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ABSTRACT 


Numeric  Function  Generators  (NFGs)  have  allowed  computation  of  difficult 
mathematical  functions  in  less  time  and  with  less  hardware  than  commonly  employed 
methods.  They  compute  piecewise  linear  (or  quadratic)  approximations  that  represent  the 
value  of  the  original  function  for  a  given  input  value.  The  domain  of  the  NFG  is  divided 
into  enough  segments  such  that  the  approximation  is  within  the  required  error  to  the 
actual  value  of  the  function.  The  linear  (or  quadratic)  approximation  varies  for  each 
segment.  The  overall  hardware  complexity  and  propagation  delay  depend  on  the  number 
of  segments  required,  the  arithmetic  devices  used  to  approximate  the  function,  and  the 
number  of  bits  used  to  represent  the  numbers  being  calculated. 

This  thesis  develops  an  accurate  method  to  quantify  hardware  utilization  and 
propagation  delay  for  various  NFG  configurations  implemented  on  Field-Programmable 
Gate  Arrays  (FPGAs).  The  algorithms  and  estimation  techniques  apply  to  different  NFG 
architectures  and  to  different  mathematical  functions.  This  thesis  compares  hardware 
utilization  and  propagation  delay  for  various  NFG  architectures,  mathematical  functions, 
word  widths,  and  segmentation  methods.  It  shows  when  a  quadratic  NFG  requires  less 
hardware  and  when  it  has  a  longer  delay  than  its  linear  NFG  counterpart  for  various 
functions.  It  also  establishes  a  criterion  for  when  non-uniform  segmentation  is  beneficial 
for  any  function,  based  on  the  size  of  the  NFG.  The  findings  in  this  thesis  show  that 
NFGs  with  non-uniform  segmentation  generally  require  more  hardware  and  almost 
always  have  longer  delays  than  NFGs  with  uniform  segmentation.  They  also  show  that 
quadratic  NFGs  required  less  hardware  and  have  shorter  delays  as  the  size  of  the  NFG 
gets  larger. 
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EXECUTIVE  SUMMARY 


This  thesis  describes  a  complexity/delay  analysis  of  numeric  function  generators 
(NFGs)  used  in  high-speed  circuits  for  realizing  arithmetic  functions  like  f(x)  =  sin(x) , 

f(x)  =  In  x ,  f  ( x )  =  yj-  lnx  ,  etc.  Specifically,  it  shows  how  complexities  and  delays  for 
NFGs  can  be  estimated  without  having  to  build  the  circuit.  It  begins  by  constructing 
basic  arithmetic  components  that  are  often  used  in  NFGs.  Each  component  is  analyzed  in 
depth  to  estimate  its  complexity  and  delay  based  on  the  number  of  input  bits,  n.  Models 
of  common  NFGs  are  built  realizing  an  approximation  equation,  y(x) .  The  models  are 
used  to  compare  various  NFG  architectures  for  particular  functions.  NFGs  with  linear 
approximation  equations  are  compared  to  NFGs  with  quadratic  approximations. 

Uniform  and  non-uniform  segmentation  methods  are  also  compared  in  this  thesis 
because  the  complexity  and  delay  of  an  NFG  greatly  depends  on  the  complexity  and 
delay  of  its  coefficients  table  and  associated  segment  index  encoder  (SIE).  Uniform 
segmentation  divides  the  function  interval  into  segments  of  even  width,  while  non- 

uniform  segmentation  divides  the  interval  into  s™l~unif  segments  of  varying  widths.  The 
maximum  segment  width  is  determined  by  a  maximum  allowable  errors , 
where  s  =  \f(x)  —  y(jc)| .  Non-uniform  NFGs  always  require  fewer  segments  than  uniform 
NFGs,  but  they  also  require  an  SIE  in  order  to  detennine  within  which  segment*  lies. 

For  13  of  the  15  functions  analyzed  in  this  thesis,  non-uniform  segmentation 
offers  no  benefits.  However,  when  non-unifonn  segmentation  drastically  reduces  the 
number  of  segments  in  an  NFG,  it  can  reduce  the  overall  hardware  complexity.  This 
occurs  in  the  remaining  2  functions.  The  amount  of  reduction  from  uniform  to  non- 
uniform  segmentation  can  be  expressed  as  a  ratio,  namely  the  segment  reduction  ratio 
(SRR).  The  minimum  SRR  required  in  order  for  non-unifonn  segmentation  to  be 
beneficial  is  SRRcril .  SRRcrjt  depends  on  the  number  of  segments,  5,  which  depends  on  s 

and  the  properties  and  domain  of  the  function  being  realized.  This  thesis  also  shows  that 
the  SRR  of  a  given  function  depends  only  on  the  properties  of  that  function  and  its 
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domain.  Thus,  for  a  given  function / (x)  ,  when  SRRf(x)  <  SRRcrit(n,s )  ,  then  an  NFG  with 

non-uniform  segmentation  requires  less  hardware  than  the  same  NFG  with  uniform 
segmentation.  When  the  number  of  segments  (corresponding  to  the  number  of  memory 
locations)  is  restricted  to  a  power  of  two,  the  number  of  segments  for  non-uniform 

segmentation  is  s"°"  =  l'  1"82''m,n  "I  and  number  of  segments  for  uniform  segmentation  is 

sunif  =  2  l,,82  'mi''  1  and  SRRcrit  min  becomes  a  function  only  of  n. 


Therefore,  for  a  basic  linear  NFG,  if 


\a^f(2\x)djc  4 


ib~a)  J/(2>(**) 


n  +  4 
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Basic  Linear 
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),  then  non-uniform  segmentation  yields  a  smaller  amount  of 


hardware.  This  is  true  for  basic  quadratic  NFGs  when 
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dx  g 
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these  equations,  a  critical  value  of  n  can  be  determined,  ncrit ,  below  which  it  is  always 
more  hardware  efficient  to  use  non-uniform  segmentation.  The  derivations  of  these 
equations  assume  that  LUT  cascades  are  used  in  the  SIE  and  Chebyshev  polynomials  are 
used  to  detennine  the  coefficients  for  the  approximation  equations.  They  also  assume 
that  basic  NFG  architectures  are  used.  The  term  “basic”  refers  to  an  architecture  that 
does  not  truncate  bits  during  its  arithmetic  operations. 


This  thesis  shows  that  non-unifonn  segmentation  always  has  a  longer  delay  than 
uniform  segmentation,  except  in  rare  trivial  NFGs  (where  n<  8  ).  In  fact,  when  NFG 
architectures  for  15  functions  were  compared  in  tenns  of  delay,  non-uniform  NFGs 
proved  the  best  only  in  a  few  cases  when  n  <  2  .  1  in  <  2  ,  then  an  NFG  is  not  required 
since  two  LUTs  can  be  used  instead.  Appendices  D.2.2  and  D.3.2  show  the  best 
architectures  based  on  delay  for  15  functions. 


Linear  and  quadratic  NFGs  are  also  compared  in  this  thesis.  Estimation  results 
show  that  linear  NFGs  consume  less  hardware  than  quadratic  NFGs  for  n  less  than  «25 
to  29  bits  (for  the  15  functions  compared).  They  also  have  smaller  delays  than  quadratic 


xviii 


NFGs  for  n  «37  to  39  bits.  This  thesis  shows  which  of  the  four  basic  architectures 
(linear  uniform  (LUB),  linear  non-uniform  (LNB),  quadratic  uniform  (QUB),  quadratic 
non-uniform(QNB))  is  best  in  terms  of  hardware  utilization  and  delay  for  all  15  functions 
analyzed.  It  also  shows  the  best  of  four  compact  NFG  architectures  (LUC,  LNC,  QUC, 
and  QNC).  The  compact  architectures  are  similar  to  the  basic  architectures  except  they 
require  smaller  arithmetic  units. 
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I.  INTRODUCTION 


A.  PROBLEM  DEFINITION 

Computer  calculations  of  numerical  functions  are  required  in  many  applications 
ranging  from  computer  graphics  to  robotics  to  radar  return  processing  [11]. 
Trigonometric,  logarithmic,  exponential,  and  power  functions  are  all  widely  used,  as  well 
as  combinations  of  them.  Well  designed  application  specific  integrated  circuits  (ASICs) 
generally  offer  the  fastest  computation  time  for  a  specific  function  because  they  are 
designed  with  that  function  in  mind.  Therefore,  they  are  usually  expensive  because  they 
are  not  in  high  demand.  However,  they  typically  serve  only  one  purpose. 
Reconfigurable  computers  are  an  important  developing  technology  that  can  be  used  to 
perform  specific  computations.  They  provide  a  universal  platform  for  a  wide  variety  of 
tasks  and  allow  the  task  to  be  changed.  Reconfigurable  computers  often  use  Field- 
Programmable  Gate  Arrays  (FPGAs)  to  implement  the  desired  logic  designs.  The  benefit 
of  using  FPGAs  for  complex  computations  is  that  the  FPGA  can  perform  the 
computations  while  the  processor  performs  other  system-related  tasks.  Having  the  FPGA 
compute  the  desired  function  is  generally  faster  than  having  the  main  microprocessor  do 
the  same  computation.  The  main  processor  can  also  perform  other  systems  tasks  instead, 
therefore  making  the  entire  computer  system  faster. 

This  thesis  analyzes  methods  for  approximating  numerical  functions.  It  also 
discusses  the  implementations  on  FPGAs,  so  problem  solutions  must  be  able  to  fit  on  a 
particular  FPGA  while  still  meeting  the  speed  and  precision  requirements  of  the 
application  requiring  the  function  computation.  This  section  discusses  some  of  the 
hardware  configurations  that  are  currently  employed  in  performing  these  calculations, 
including  using  numeric  function  generators  (NFGs). 

1.  Methods  for  Numeric  Function  Computation 

There  are  several  methods  for  computing  real  functions  with  electronic  hardware. 
The  following  methods  are  commonly  employed. 
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a.  Lookup  Table 

A  simple  method  for  computing  a  numerical  function  is  by  using  a  lookup- 
table  (LUT).  LUTs  use  input  variable  x  as  the  address  to  a  memory  block.  The  data 
word  stored  at  that  address  is  the  function’s  value /(x).  This  method  requires  an 
enonnous  amount  of  memory  for  any  relatively  large  computing  system.  Consider  a 
simple  architecture  where  x  has  16  bits  and  the  result  has  16-bits.  The  LUT  requires 
216x16  =  220  =  1,048, 576  memory  bits,  or  131,072  bytes.  This  is  relatively  large  amount 
for  such  a  small  number  system,  making  it  very  difficult  to  implement  on  FPGAs. 
Modern  computer  systems  require  n  to  be  much  larger,  generally  32  or  64-bits.  A  32-bit 
LUT  requires  over  17  Gbytes,  and  a  64-bit  LUT  requires  1.5  xlO20  bytes.  Because  of  the 
size  requirements,  LUTs  are  generally  not  the  best  solution  for  recon  figurable  computers 
because  they  do  not  fit  on  commonly  used  FPGAs. 

b.  CORDIC 

Coordinate  Rotational  Digital  Computer  (CORDIC)  algorithms  are  often 
used  because  they  require  a  small  amount  of  hardware  [1]  [11].  They  are  used  in  many 
pocket  calculators  and  floating-point  coprocessors  [6]. 

CORDIC  devices  perform  successive  arithmetic  operations  iteratively. 
Each  of  the  iterations  increases  the  precision  of  the  result.  Modern  technology  requires  a 
high  accuracy  in  very  little  time.  Since  the  precision  of  CORDIC  algorithms  are 
proportional  to  the  computation  time,  they  are  becoming  less  acceptable  [16]  for  high¬ 
speed  applications.  In  addition,  CORDIC  algorithms  have  been  developed  only  for  a 
limited  set  of  functions. 

c.  Power  Series 

Some  numerical  functions  can  be  decomposed  into  an  infinite  series 
known  as  a  power  series.  The  power  series  is  an  infinite  sum  of  powers  of  an  input 

oo 

variable  x,  or  f(x)  =  ^ai(x-c)‘ =  a0  + ar(x-c)  + a2(x-c)2 +... .  When  c=0,  this 

i=0 


2 


architecture  can  be  implemented  compactly  in  an  iterative  form,  requiring  a  multiplier,  an 
adder,  a  register,  and  memory  storage  for  the  coefficients  ai .  Like  the  CORDIC 
algorithm,  the  accuracy  of  the  result  depends  on  the  number  of  iterations  of  the  algorithm 
and  it  can  be  applied  only  to  a  limited  number  of  functions.  For  example,  /(x)  =  ex  can 


00  x" 


2  3 

X  X 


be  calculated  by  represented  by  the  power  series  ex  =^: —  =  l  +  x  + - t- - f...  but 

i  o  n !  2 !  3 ! 

more  complex  functions  might  not  be  able  to  be  computed. 


d.  Shift  and  Add  Algorithms 

Shift  and  add  algorithms,  such  as  the  BKM  algorithm  [6]  (named  for  its 
developers  J.C.  Bajard,  S.  Kla,  and  J.M.  Muller),  have  been  developed  to  compute 
functions  without  using  multipliers.  They  simply  iterate  shifts  and  add,  thus  reducing  the 
hardware  significantly.  BKM  algorithms  compute  a  limited  number  of  functions, 
including:  2-D  vector  rotations,  logarithmic  functions,  exponential  functions,  sine  and 
cosine  functions  and  arctan  functions  [6].  However,  their  precision  still  depends  on  the 
number  of  iterations  in  the  computation;  therefore  they  often  do  not  meet  the 
requirements  of  high-speed  applications. 


<?.  NFGs 


NFGs  return  a  function  value  by  using  piece-wise  approximations.  NFGs 
require  a  few  basic  arithmetic  devices  and  a  coefficient  memory  or  LUT.  The  memory 
size  generally  depends  on  the  function  being  implemented  and  the  precision  of  the 
system,  but  it  is  always  smaller  than  that  of  using  a  LUT  alone.  NFGs  perform  the  same 
numerical  calculations  for  every  function  (for  example,  / (x)  =  c,x  +  c0  for  linear 

approximation),  but  just  use  different  coefficients.  NFGs  can  be  considered  a 
combination  of  the  methods  described  above.  They  use  less  memory  than  a  LUT  alone 
and  they  often  employ  arithmetic  devices  (multipliers  and  adders)  similar  to  power  series 
architectures.  However,  the  computation  by  an  NFG  is  not  iterative.  Thus,  NFGs  can 
compute  any  function  with  a  small  amount  of  hardware  and  a  small  computation  time. 
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2. 


Goal  of  This  Thesis 


This  thesis  analyzes  NFG  architectures  in  depth  to  make  accurate  estimations  of 
complexity  and  delay.  In  this  way,  we  can  understand  easily,  for  example,  how  tradeoffs 
can  be  made  between  complexity,  delay  and  accuracy.  The  only  other  way  is  to  build 
actual  designs,  which  is  computationally  intensive.  It  analyzes  and  compares  arithmetic 
component  complexity  and  delay  as  well  as  NFG  architectures  that  are  composed  of 
those  components.  It  develops  models  of  common  architectures,  and  provides  a 
framework  with  which  any  architecture  can  be  built.  Models  for  simple  NFG 
architectures  are  compared  to  determine  which  are  the  most  efficient  with  respect  to 
hardware  utilization  and  delay.  Comparisons  include  hardware  utilization  and  delay  for 
linear  versus  quadratic  NFGs,  as  well  for  NFGs  with  uniform  versus  non-uniform 
segmentation. 

B.  THESIS  ORGANIZATION 

Chapter  I  introduces  the  problem  being  discussed  in  this  thesis,  including  some  of 
the  current  methods  to  solve  the  problem.  It  also  discusses  why  this  thesis  focuses  on 
NFGs  instead  of  analyzing  the  other  methods.  Chapter  II  focuses  on  the  basic 
understanding  of  how  linear  and  quadratic  NFGs  work,  including  their  basic 
architectures.  Chapter  III  develops  accurate  tools  to  measure  hardware  utilization  and 
propagation  delay  for  the  basic  arithmetic  components  commonly  used  in  NFGs.  It 
explains  how  simulation  data  was  obtained  and  used  to  estimate  various  NFG 
configurations.  Chapter  IV  builds  models  for  NFG  architectures  commonly  used  in 
recent  resources.  Each  model  can  realize  any  function.  Chapter  IV  also  establishes  a 
framework  by  which  any  particular  NFG  architecture  can  be  built.  Chapter  V  compares 
the  models  in  Chapter  IV  for  example  functions.  It  shows  when  it  is  better  to  use 
quadratic  versus  linear  NFGs  for  several  functions  based  on  hardware  utilization  and 
delay.  It  also  develops  a  criterion  for  determining  whether  or  not  it  is  better  to  use  non- 
uniform  segmentation.  Chapter  VI  summarizes  the  findings  of  Chapter  V  and  discusses 
future  applications  of  the  modeling  methods  in  this  thesis. 
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II.  BACKGROUND  ON  NFGS 


This  chapter  discusses  how  linear  and  quadratic  NFGs  operate.  It  is  mostly 
concerned  with  NFGs  implemented  on  recon figurable  computers  and  FPGAs.  The 
architecture  of  an  NFG  is  somewhat  independent  of  the  function  being  realized.  Thus,  a 
generic  NFG  can  be  used  to  realize  a  wide  range  of  functions  without  having  to  redesign 
logic  circuits.  Also,  NFGs  on  FPGAs  are  recon  figurable,  so  it  is  easy  to  reprogram  it  to 
compute  a  different  function. 

A.  GENERAL  NFG  OPERATION 

An  NFG  is  an  arithmetic  logic  device  that  estimates  the  value  of  a  real 
function  / (x)  for  a  given  input  x  using  a  piecewise  approximation  y(x) .  The  domain  of 

the  NFG  [ a,b]  is  divided  into  5  segments  each  with  domain  [xminPxmax(.),  where  i  is  the 

segment  index  number.  Thus,  y(x)=  v,(x)  iff  xmini.  <  x  <  xmax/ .  Each  approximation 

function  y,(x)  may  be  a  linear,  quadratic  or  some  other  simple  function  of  x.  For  all 

inputs  x,  the  NFG  must  detennine  what  segment  it  is  in  in  order  to  determine  the 
approximation  function. 

B.  LINEAR  NFGS 

Simple  linear  NFGs  use  the  approximation  function  y.  (x)  =  cux  +  c0i  for  each 
segment,  where  i  e  □  and  1  <  i  <  s  .  The  values  for  c1(.  and  c0i  are  stored  in  a  coefficients 
table  and  recalled  once  the  segment  number  i  is  known  for  a  particular  x.  Figure  1 
shows  an  example  of  how  linear  approximation  functions  are  used  for  each  segment.  In 
the  example,  /(x)  =  2vwith  a  domain  [0,5]  and  s=5,  and  the  particular  segment  index 
i=4. 
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Figure  1  Linear  Approximation  for  a  Single  Segment  for  / (x)  =  2' . 


1. 


Basic  Linear  NFG  Architecture 


The  architecture  of  a  basic  linear  NFG  is  shown  in  Figure  2.  It  consists  of 
arithmetic  components  (multiplier  and  adder),  a  memory  to  store  coefficients,  and  logic 
circuit  to  determine  the  segment  index  (if  necessary). 

x[n-1 :0] 

x[n-1 :0] 


y[n-1:0]  y[n-1:0] 

(a)  Uniform  Segmentation  (b)  Non-Uniform  Segmentation 
Figure  2  Basic  Linear  NFG  Architecture.  (After  [12]) 


2.  Approximation  Techniques 

The  linear  equations  y,  (x)  are  computed  prior  to  constructing  the  NFG  for  each 
segment.  They  are  stored  in  the  coefficients  table.  The  coefficients  can  be  determined  by 
several  methods,  a  few  of  which  are  described  below. 
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a.  Secant  Line  Approximation  (SLA) 

For  a  given  segment  i,  the  endpoints  of  the  segment  ( xmin ,  and  xmax  )  are 
used  to  determine  the  slope  and  intercept  values  ( cu  andc0j. ,  respectively).  The  slope  is 

/(*  max,/  )"/(*: min,,)  .  ~ 

cu  = - : - and  the  intercept  value  isc0i  =  J (xmi  i)-cuxmini.  The  error  ot 

X  —  X  ’ 

max,/  min,/ 

this  approximation  is  sSLA  =  |  fix')  -  y,.(x)|max . 

b.  Modified  Secant  Line  Approximation  (MSLA) 

The  SLA  method  is  a  quick  method  to  estimate  a  function  over  a  given 
segment,  but  it  is  obviously  not  the  most  accurate.  The  maximum  error  in  a  particular 
segment  can  be  reduced  by  adjusting  c0,  by  a  value  less  than  sSLA.  Consider  a  function 

fix)  that  is  monotone  increasing  or  decreasing  over  [xmini.,xmax(] .  The  linear 
approximation  yfx)  =  cux  +  c0j  ^  fix)  on  (xni]n ,,  xmax  . ) .  Therefore,  yfx)  is  always  greater 
than  or  less  than  fix)  on  (xmini,xmax; ) .  If  yfx)  >  fix)  on  (xinin,,xmax,) ,  then  subtracting 

ssla  / 2  from  c0i  (from  the  SLA),  yields  a  maximum  error  of  £MSLA  =  sSLA  / 2  for  the 
segment.  Figure  3  shows  the  difference  between  the  linear  approximation  equations 
using  SLA  and  MSLA. 

c.  Least  Squares  Approximations 

MATLAB  uses  a  function  called  polyfit  to  calculate  coefficients  for  linear, 
quadratic  and  higher  order  approximation  functions  based  on  the  least-squares  error.  The 
least  squares  method  is  commonly  used  to  minimize  the  sum  of  the  differences  between 
two  given  functions.  This  particular  method  is  not  desired  for  applications  with  NFGs. 

NFGs  are  concerned  with  being  able  to  compute  a  value  of  a  function  and  yield  an 

answer  that  is  correct  to  the  limits  of  the  number  system  on  which  it  is  implemented. 
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NFGs  are  designed  to  produce  a  result  with  an  error  that  is  less  than  a  maximum  specified 
error,  and  not  to  minimize  the  sum  or  average  errors.  The  example  in  Figure  3  shows  that 
the  polyfit  function  (using  a  linear  fit)  produces  a  larger  maximum  error  than  the  MSLA. 


Various  Linear  Approximation  Methods 


15 
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Figure  3 


Linear  Approximations  of  / (x)  =  2 ' 


C.  QUADRATIC  NFGS 


Quadratic  NFGs  use  the  approximation  function y,(x)  =  c2ix2  +  cux  +  c0i  for  each 
segment,  where  ie  Q  and  1  <i<s.  The  values  forc2i ,  c1(  and  c0j  are  stored  in  a 

coefficients  table  and  recalled  once  the  segment  number  is  known  for  a  particular  x. 
Figure  4  shows  an  example  of  how  quadratic  approximation  functions  are  used  for  each 
segment.  The  example  is  the  same  as  discussed  for  a  linear  approximation  above. 
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35 


/«  =  2' 


Figure  4 


Quadratic  Approximation  for  a  Single  Segment  for  f(x)  =  2X . 


1.  Basic  Quadratic  NFG  Architecture 

The  architecture  of  a  basic  quadratic  NFG  is  shown  in  Figure  5.  Like  the  linear 
architecture,  it  also  consists  of  arithmetic  components  (multipliers  and  adders),  a  memory 
to  store  coefficients,  and  logic  circuit  to  determine  the  segment  index.  However, 
quadratic  NFGs  require  three  multipliers  and  a  3-input  adder.  Although  quadratic  NFGs 
require  more  arithmetic  devices  than  linear  NFGs,  they  require  fewer  segments,  and  thus 
smaller  memory  sizes. 


x[/7-1 :0] 


y[n- 1:0]  y[/>1:0] 

(a)  Uniform  Segmentation  (b)  Non-Uniform  Segmentation 
Figure  5  Basic  Quadratic  NFG  Architectures.  (After  [8]). 
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2.  Approximation  Techniques 


Determining  the  best  coefficients  for  quadratic  approximations  is  quite  difficult 
and  cannot  be  generalized  for  all  functions.  However,  some  methods  have  been 
considered  sufficient  to  find  coefficients  that  can  accurately  approximate  given  functions. 
Several  approximation  techniques  are  outlined  in  [6],  but  the  ones  of  concern  are  those 
that  minimize  the  maximum  error  in  each  segment.  These  are  known  as  the  least 
maximum  polynomial  approximations  [6], 

a.  2nd  Order  Chebyshev  Polynomial  Approximation 

Chebyshev  polynomials  provide  a  straightforward  method  for  detennining 
the  coefficients  required  to  approximate  a  function  with  any  order  polynomial. 
“Chebyshev  polynomials  play  a  central  role  in  approximation  theory  [6].”  They  have 
been  studied  in  depth  and  have  many  properties  that  allow  simple  error  calculations. 
Their  properties  are  used  to  prove  asymptotic  relations  for  finding  the  widest  segment 
required  and  for  finding  the  minimum  number  of  segments  required. 

b.  Minimax  Approximation 

Second  order  minimax  approximations  use  the  fact  that  there  are  at  least 
four  values  of  x  where  the  maximum  approximation  error  is  reached  with  alternating 
signs,  namely  Xo,  Xj,  X2,  and  Xj  [6].  The  minimax  approximation  solves  the  following  set 
of  equations  to  determine  the  coefficients  of  the  polynomial  approximation 

T;  (x)  =  c2ix2  +  cux  +  c0i  ■ 

y(x())-f(x(})  =  £ 
y(xt)-f(x])  =  -£ 
y(x2)-f(x2)  =  s 
y(x3)-f(x 3)=-£ 
dy{x i)  df(x1)_ 
dx  dx 

dy{x2)  df(x2)  _  q 
dx  dx 
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c. 


Remez  Algorithm 


The  Remez  algorithm  for  finding  polynomial  coefficients  is  an  iterative 
method  that  starts  with  coefficient  value  estimates  from  typically  either  Chebyshev  or 
minimax  approximations.  The  points  where  the  error  is  maximum  are  found  and  then 
used  to  calculate  new  coefficients,  reducing  the  new  error.  Since  Chebyshev  polynomials 
have  approximations  that  are  very  close  to  optimum,  the  Remez  algorithm  quickly 
converges.  This  method  often  provides  coefficients  that  more  accurately  compute  the 
NFG  approximations.  This  results  in  larger  segment  sizes.  Therefore,  it  also  results  in 
fewer  required  segments. 

D.  FACTORS  CONTRIBUTING  TO  COMPLEXITY  AND  DELAY 

The  complexity  and  delay  of  an  NFG  depends  on  the  complexity  and  delay  of  its 
arithmetic  components,  as  well  as  the  size  of  the  coefficient  table  required. 

1.  Factors  Affecting  Arithmetic  Component  Complexity  and  Delay 

a.  The  Size  of  the  NFG 

The  size  of  the  NFG  n,  refers  to  the  number  of  bits  input  into  the  NFG. 
The  examples  analyzed  in  this  thesis  also  assume  that  the  NFG  produces  the  same 
number  of  bits  for  its  result.  As  n  grows,  the  complexity  and  delay  grow  because  more 
logic  gates  are  required  for  each  of  the  components  in  the  NFG.  For  example,  a  32-bit 
adder  requires  more  logic  gates  and  has  a  longer  delay  than  a  16-bit  adder. 

b.  NFG  Architecture 

NFGs  can  be  configured  in  several  ways.  The  architecture  determines 
what  components  and  how  many  components  are  needed  to  realize  a  function /(x)  .  For 
example,  a  basic  linear  NFG  with  uniform  segmentation  requires  a  multiplier,  adder,  and 
coefficients  table.  An  equivalent  basic  quadratic  NFG  with  unifonn  segmentation 
requires  three  multipliers  and  two  adders.  Other  configurations  can  require  other 

arrangements  and  numbers  of  component  which  all  contribute  to  the  total  complexity  and 
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delay.  Some  NFGs  can  be  arranged  to  compute  several  operations  in  parallel  to  minimize 
overall  delay.  Thus,  the  architecture  plays  a  large  role  in  the  complexity  and  delay  of  the 
NFG  components. 

2.  Factors  Affecting  the  Number  of  Segments 

The  number  of  segments  depends  on  the  size  of  the  NFG,  n,  f(x)  and  its  domain 
[n,h],  and  the  segmentation  method.  The  number  of  segments  detennines  how  much 
memory  is  required  to  store  the  coefficients  for  the  estimation  equation  y(x) .  They  are 
analyzed  further  in  later  chapters. 

a.  Function  and  NFG  Domain 

Asymptotic  equations  in  [5]  show  that  the  minimum  segment  width 
required  is  a  function  of  the  2nd  or  3rd  derivative  of  / (x)  for  linear  and  quadratic  NFGs 
respectively.  Thus,  for  a  given  NFG  domain,  the  number  of  segments  required  also 
depends  on  the  particular  function  / (x)  realized  by  the  NFG.  As  the  domain  of  the  NFG 
gets  larger,  more  segments  are  required  for  the  same  allowable  errors  . 

b.  The  Size  of  the  NFG 

The  number  system,  or  the  number  of  bits  in  the  input  and  output  of  an 
NFG,  plays  a  role  in  determining  the  maximum  allowable  error.  The  goal  of  an  NFG  is 
to  compute  an  approximation  with  an  error  that  won’t  be  noticed  by  the  system  that  is 
using  the  NFG.  As  n  grows  the  allowable  errors  gets  smaller,  requiring  more  segments. 
Also,  the  size  of  the  NFG  generally  affects  the  required  precision  for  the  NFG,  which 
affects  the  number  of  required  segments.  Therefore,  the  size  of  the  coefficient  table  also 
depends  on  n. 


c.  Segmentation  Method 

Choosing  between  uniform  and  non-uniform  segmentation  can  drastically 
affect  the  overall  number  of  segments  required.  Methods  in  [5]  derive  a  minimum 
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segment  width  cr^  ,  for  a  given  function  on  a  given  interval  [a,b\  Dividing  the  interval 
into  uniform-width  segments,  each  cr.  =  crmjn  for  all  i,  where  1  <i<  smm  .  Here  ,smin  is  the 


minimum  number  of  segments  required  and  smin  = 


b-a 


cr 


Non-uniform  segmentation 


over  the  same  interval  first  finds  crmin  and  uses  it  for  a  particular  segment,  cr. .  For 
optimum  segmentation,  a  new  crmin  is  found  for  the  remaining  portion  of  the  interval 

(excluding  segment  i).  This  occurs  repeatedly  until  the  segments  include  the  entire 
domain  of  the  NFG.  Non-uniform  segmentation  always  produces  fewer  segments. 
Figure  6  shows  an  example  to  compare  the  number  of  segments  required  for  uniform  and 
non-uniform  segmentation  of  /(x)  =  cos  rex  on  [0,0.5]  for  an  s  =  2 ~9 .  Uniform 
segmentation  requires  11  segments,  and  non-uniform  segmentation  requires  10. 


UNIFORM  f(x)=cos(pi*x)  segmentation.  No.  of  segments  =  11.  NON-UNIFORM  f(x)=cos(pi*x)  segmentation.  No.  of  segments  =  10. 

1.2 

1 

0.8 
0.6 
0.4 
0.2 
0 

-0.2 

0  0.1  0.2  0.3  0.4  0.5  x 

(a)  Uniform  Segmentation  (b)  Non-Uniform  Segmentation 

Figure  6  Uniform  vs.  Non-Uniform  Segmentation.  (From  [20]) 


E.  CHAPTER  SUMMARY 

This  chapter  shows  how  NFGs  approximate  real  functions,  including  several 
methods  for  computing  the  coefficients  of  the  approximation  equations,  ft  also  shows 
factors  that  affect  the  complexity  and  delay  of  NFGs  and  the  components  required  to 
construct  four  basic  NFG  architectures.  The  next  chapter  shows  how  each  of  these 
components  (and  others)  can  be  built  on  the  Xilinx  Virtex-11.  ft  estimates  the  complexity 
and  delay  based  on  the  size  of  each  component  using  simulation  data  and  approximated 
data. 
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III.  ANALYZING  HARDWARE  COMPLEXITIES  AND 
PROPAGATION  DELAYS 


This  chapter  proposes  a  method  to  estimate  circuit  complexity  and  speed  for 
common  NFG  components.  This  will  allow  us  to  compare  the  hardware  complexity  and 
speed  of  various  NFG  configurations.  A  standard  method  for  measuring  these  quantities 
is  proposed.  The  proposed  method  is  applicable  to  a  wide  range  of  configurations, 
providing  meaningful  comparisons  among  various  NFG  configurations. 

The  supporting  data  was  observed  using  particular  hardware  (Xilinx  Virtex-II) 
and  software  (Xilinx  ISE  Project  Navigator),  but  the  methods  can  be  applied  universally 
to  other  FPGAs  with  minor  alterations.  Since  the  method  of  measuring  is  standardized,  it 
provides  a  meaningful  approach  in  understanding  the  relative  complexity  of  realizing 
different  arithmetic  functions. 

When  actually  designing  an  arithmetic  logic  device,  pipelining  can  dramatically 
reduce  propagation  delays  for  the  circuit.  In  best  case  scenarios,  pipelining  can  cause  the 
circuit  to  output  an  answer  every  clock  period.  A  disadvantage  of  pipelining  comes  from 
an  initial  delay  due  to  the  pipeline  depth.  Large  circuits  tend  to  have  a  large  pipeline 
depth,  which  means  there  is  a  long  delay  from  the  time  data  is  input  into  the  circuit,  until 
the  result  comes  out.  Because  pipelining  can  be  implemented  at  a  various  points  in  a 
logic  circuit,  it  is  difficult  to  reach  a  standard  way  to  measure  time  delay.  For  this  reason, 
this  thesis  implements  combinational  logic  circuits  instead  of  pipelined  circuits.  In 
general,  a  combinational  logic  circuit  that  has  a  longer  propagation  delay  will  tend  to 
have  a  longer  pipeline  depth  as  well.  Thus,  it  is  a  relevant  method  of  delay  measurement. 

A.  HARDWARE  RESOURCES 

NFG  component  circuit  designs  are  simulated  and  synthesized  for  the  Xilinx 
Virtex-II  XC2V6000  FPGA  with  a  speed  grade  of  -4.  This  is  the  FPGA  that  is  presently 
available  on  the  SRC-6,  a  reconfigurable  computer  at  NPS.  This  section  explains  the 
general  architecture  of  the  Xilinx  Virtex-II  FPGA,  including  the  available  logic 
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resources.  The  Virtex-II  includes  Combinational  Logic  Blocks  (CLBs),  18-by-  18-bit 
signed  multipliers  (MULT  18x1 8s),  and  Block  Select  RAM  (BRAM).  Figure  7  shows 
how  these  resources  are  arranged  on  the  Virtex-II  FPGA. 


DCM  DCM  IOB 


CLB  Block  SelectRAM  Multiplier 


DS«5i_2S_ii>:«o:' 


Figure  1:  Virtex-II  Architecture  Overview 

Figure  7  General  Placement  of  Resources  on  Xilinx  Virtex-II  FPGA.  (From  [18]) 


Also  shown  are  the  Digital  Clock  Manager  (DCM)  units  and  Input/Output  Blocks 
(IOBs),  which  are  not  used  in  the  complexity  measure.  DCMs  can  be  used  to  de-skew 
clock  signals,  manage  multiple  clock  phases,  create  multiple  frequency  clock  signals,  and 
more  [19].  The  analyses  in  this  thesis  consider  combinational  logic  delays  and  do  not 
take  into  account  complicated  clocking  schemes.  Therefore,  DCM  usage  is  not 
considered  in  this  thesis.  IOBs  route  signals  from  the  input  pins  to  the  logic  circuitry  in 
the  FPGA  and  route  signals  from  the  logic  circuitry  to  the  output  pins.  The  NFGs 
considered  in  this  thesis  are  built  from  the  available  logic  within  the  Virtex-II 
XC2V6000.  Thus,  for  a  given  NFG  size  n,  the  number  of  IOBs  consumed  is  2 n.  In  this 
thesis,  all  of  the  available  logic  resources  are  always  consumed  before  the  IOBs. 
Therefore,  the  number  of  IOBs  consumed  is  not  relevant  when  comparing  NFGs  of  the 
same  size. 

Each  CLB  on  the  Virtex-II  FPGA  is  subdivided  into  four  slices.  Each  slice  is 

identical,  except  for  its  position  in  the  CLB.  Thus,  the  number  of  available  slices  is  also  a 

good  measure  of  logic  resources.  Table  1  shows  the  five  resources  and  the  quantity 
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available  on  the  Xilinx  Virtex-II  XC2V6000  FPGA.  The  amount  of  resources  available, 
timing  information  for  specific  logic  devices,  and  other  specifications  are  included  in  the 
author’s  MATLAB  file  LoadlSEDeviceData.  It  also  imports  some  data  from 
simulations.  NFGs  implemented  on  other  FPGAs  can  be  analyzed  by  altering 
LoadlSEDeviceData  to  contain  specifications  for  that  particular  FPGA. 


Resource 

Quantity 

Slices 

33792 

MULT  18x1 8 

144 

BRAM 

144 

IOB 

1104 

DCM 

12 

Table  1  Xilinx  Virtex-II  XC2V6000  Resources.  (From  [18]) 

1.  CLBs 

The  most  basic  element  of  the  CLB  is  the  function  generator.  The  function 
generator  can  be  configured  to  realize  a  4-input  1 -output  logic  function  or  LUT,  a 
ROM/RAM  with  16  1 -bit- words  (16x1),  or  a  16  bit  shift  register.  Even  though  16x1 
RAM  units  are  realizable  with  a  LUT,  the  circuits  analyzed  in  this  thesis  do  not  require 
RAM,  therefore  there  will  be  no  further  discussion  of  components  that  are  related  to 
RAM.  For  the  purpose  of  this  thesis,  the  function  generator  can  be  considered  a  look  up 
table  independent  of  what  purpose  it  serves.  For  example,  a  16x1  ROM  is  a  4  input  to 
single  output  function.  Xilinx  has  configured  quick  paths  for  linking  these  devices  to 
larger  configurations  based  on  what  purpose  they  serve.  These  timing  characteristics  are 
taken  into  account  when  building  and  analyzing  the  specific  components  on  the  FPGA. 
When  considering  how  much  hardware  is  used,  the  specific  function  of  the  function 
generator  is  irrelevant.  The  circuit  designs  in  this  thesis  most  often  use  the  function 
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generator  as  a  LUT.  Therefore,  in  order  to  simplify  terminology,  each  function  generator 
is  referred  to  as  a  LUT.  Figure  8  illustrates  a  portion  of  the  basic  slice  of  a  Virtex-II 
FPGA,  highlighting  some  of  the  logic  devices  that  are  used  in  this  thesis. 


Figure  16:  Virtex-II  Slice  (Top  Half) 

Figure  8  One-Half  of  a  Xilinx  Virtex-II  Slice.  (After  [18]) 


A  slice  combines  two  LUTs  with  additional  hardware  including  several  MUXs, 
two  clocked  registers,  and  additional  gates  that  are  commonly  used  in  arithmetic 
operations  (XORCY,  ORCY,  etc.).  Thus,  Xilinx  has  made  the  basic  slice  extremely 
versatile  and  efficient  for  common  operations.  There  are  four  slices  per  CLB.  These  are 
connected  together  efficiently  with  minimal  signal  propagation  delay.  Four  slices 
comprise  a  CLB.  See  Figure  9  for  an  illustration  of  the  CLB  layout. 
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Figure  9  Xilinx  XCV6000  CLB  Layout.  (From[18]) 

2.  MULT18xl8s 

The  MULT  18x1 8  is  a  signed  two’s  complement  multiplier.  Thus,  it  can  multiply 
two  17-bit  magnitude  numbers,  and  return  a  35-bit  magnitude  result  along  with  an  extra 
bit  for  the  sign.  The  MULT  18x1 8s  are  arranged  in  columns  as  shown  in  Figure  7.  This 
reduces  the  propagation  delay  between  the  MULT  18x1 8  and  its  surrounding  components, 
allowing  for  fast  connections  between  MULT  18x1 8  to  BRAMs,  CLBs  or  IOBs.  The 
MULT18xl8s  cannot  be  configured  to  perform  other  functions,  but  they  may  be  used  as 
multipliers  with  less  than  18 -bit  multiplicands.  There  are  a  few  benefits  for  using  it  for 
smaller  multipliers.  First,  the  circuit  designer  does  not  need  to  design  a  multiplier  from 
CLBs  (which  would  be  slow).  Second,  because  the  multiplier  does  not  consume  CLBs, 
the  CLBs  can  be  used  for  other  functions.  This  consumes  all  of  the  resources  more 
evenly.  Finally,  when  considering  circuit  performance,  using  multiplicands  with  fewer 
bits  results  in  fewer  bits  in  the  product.  This  results  in  a  smaller  propagation  delay 
through  the  MULT  18x1 8.  Xilinx  has  designed  the  Virtex-II  such  that  the  delay  from  the 
input  to  the  output  is  linear  with  respect  to  the  output  pin.  For  example,  if  the  MSB  of 
the  product  comes  off  of  pin  a  and  it  takes  ta  to  propagate  through  the  MULT  18x1 8, 

then  a  multiplier  with  its  MSB  off  of  pin  a  +  k  takes  ta+k  =ta+kS ,  where  a,k  e.0  , 

0  <  a  +  k  <35  ,  and  8  is  the  slope  of  the  line  in  Figure  10.  The  Multiplier  Switching 
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Section  in  [18]  shows  the  delay  at  each  pin,  from  the  LSB  of  the  multiplicand  to  the  MSB 
of  the  product.  The  synthesis  reports  show  this  linear  relation  (Appendix  B.  1). 

Pin  35 


Pin  0 

Delay  - ►  U<a002_C3_023JW2600 

Figure  2-11:  Pin-to-Delay  Ratio  Curve 

Figure  10  Pin-to-Delay  ratio  curve  for  MULT  18x1 8.  (From  [19]) 

3.  II  RAM  s 

BRAMs  are  an  integral  resource  on  the  Virtex-II.  They  are  arranged  in  columns 
between  the  MULT  18x1 8s  and  the  CLBs.  This  reduces  the  delay  between  memory  and 
the  multipliers.  Each  of  the  6  columns  contains  24  BRAMs.  Each  BRAM  contains  up  to 
18Kb its,  and  can  be  configured  in  various  word  widths,  (1  to  36  bits).  Thus,  each  BRAM 
uses  9  to  14  address  lines,  depending  on  the  width  of  the  word  stored.  There  are  a  total  of 
324Kbytes  of  data  storage  in  BRAMs  on  the  Virtex-II  XC2V6000. 


B.  SOFTWARE 

This  section  discusses  the  software  that  was  used  to  obtain  simulation  data  and  to 
estimate  complexity  and  delay  for  NFG  components. 

1.  Xilinx  ISE  Project  Navigator 

Xilinx  ISE  Project  Navigator  was  used  extensively  for  designing,  simulating  and 
synthesizing  various  arithmetic  logic  devices.  The  software  suite  includes  schematic  and 
VHDL  editors  along  with  a  library  of  hardware  primitive  components.  In  some  cases, 
behavioral  VHDL  modules  were  created,  and,  in  other  cases,  schematic  modules  were 
created.  Once  a  particular  module  was  created,  it  was  synthesized  to  provide  estimations 
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of  hardware  utilization  and  worst  case  propagation  delays.  Examples  of  the  synthesis 
reports  are  contained  in  Appendix  B.l. 

2.  MATLAB 

MATLAB  was  also  used  extensively.  MATLAB  was  used  to  plot  data  obtained 
from  the  synthesis  reports.  It  was  also  used  to  import  the  same  data  and  to  estimate 
hardware  utilization  and  delay  for  various  arithmetic  devices.  It  was  used  for  visual 
analysis  of  NFG  hardware  utilization  and  propagation  delays.  A  summary  of  the 
MATLAB  source  code  is  in  Appendix  A. 

C.  DATA  COLLECTION  AND  ESTIMATION 

In  order  to  analyze  a  particular  NFG’s  hardware  utilization  and  propagation  delay, 
it  is  necessary  to  have  data  on  the  particular  arithmetic  components  that  are  used  by  the 
NFG.  For  example,  if  an  NFG  requires  a  23x23-bit  multiplier  and  a  46-bit  adder,  then  it 
is  necessary  to  know  the  hardware  utilization  and  propagation  delay  for  the  23x23-bit 
multiplier  and  the  46-bit  adder.  The  goal  of  collecting  the  data  for  this  thesis  is  to  obtain 
relatively  accurate  measurements  in  order  to  be  able  to  estimate  complexity  and  delay 
parameters  without  having  to  implement  a  specific  logic  design  of  each  NFG.  In 
addition,  it  might  be  required  that  we  compare  this  same  NFG  to  a  similar  one  with  a 
22x22-bit  multiplier  and  a  44-bit  adder.  Since  it  is  impractical  to  construct  multipliers, 
adders  (and  other  arithmetic  devices)  of  every  possible  size,  only  a  subset  of  sizes  were 
considered.  The  pertinent  information  was  gathered  from  the  synthesis  reports  into  the 
text  files  in  Appendix  B.2.  Timing  data  from  the  synthesis  reports  was  used  because  it 
was  accurate  to  lps.  Timing  information  provided  in  [18]  was  only  accurate  to  lOps,  but 
still  confirmed  the  data  obtained  through  simulation.  Since  the  data  did  not  cover  all 
possible  sizes,  estimates  were  made  so  that  a  data  point  exists  for  components  of  all  sizes 
ranging  from  1-bit  components  up  to  129-bit  components.  In  some  cases,  such  as  the 
Ripple  Carry  Adder  (RCA),  equations  were  developed  that  match  all  of  the  simulation 
data  points.  In  other  cases,  such  as  the  multiplier,  missing  data  points  were  estimated 
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using  linear  approximations.  Device  architectures  and  trend  analysis  of  the  data  points 
were  both  considered  when  deciding  what  data  points  to  collect. 

1.  Making  Linear  Approximations  for  Missing  Data  Points 

The  author’s  MATLAB  function  fillLin  takes  scattered  x  and  y  data  points,  given 
in  array  form,  and  estimates  the  data  points  in  between  the  given  x  values.  The  array  x 
must  be  an  array  of  monotonic  increasing  integers.  The  length  of  the  array  x  must  be  the 
same  as  the  array  y.  This  is  applicable  to  this  thesis  because  this  function  will  estimate  a 
parameter  of  an  /? -bit  sized  device,  where  n  e[J*L  The  array  x  holds  the  n  values  in  the 
collected  data  tables,  and  the  array  y  holds  the  propagation  delay  values  or  the  hardware 
utilization  values.  The  fillLin  function  produces  an  array  y’  where  the  index  ranges  from 
1  to  the  maximum  value  of  the  original  x  array,  and  the  value  is  the  estimated  function 
value  evaluated  at  the  index  number.  For  example,  to  approximate  a  known 
function  / (x)  =  x2 ,  where  data  points  are  taken  at  x  =  1,  2,  4,  7,  and  9,  call  the  function  in 
MATLAB  with  the  array  jc  =  [1  247  9],  and  the  corresponding  array  y  =[1  4  16  49  81]. 
The  function  “fillLin”  returns  the  array  y’  =  [14  10  16  27  38  49  65  81].  The  array  y’  is 
now  9  elements  long,  and  has  a  value  for  every  integer  x,  ranging  from  1  to  9.  To  obtain 
y(3),  or  32,  simply  call  y’  with  3  as  the  index  into  y’,  resulting  in  y’(3)  =  10.  Of  course, 
this  example  illustrates  the  inaccuracies  of  the  approximation,  but  as  more  data  points  are 
collected,  better  approximations  occur.  Also,  this  function  is  applied  only  to 
monotonically  increasing  functions,  namely  hardware  utilization  with  respect  to  word 
size,  and  propagation  delay  with  respect  to  word  size.  As  the  word  size  of  an  arithmetic 
device  gets  larger,  both  complexity  and  delay  get  larger.  Even  slightly  inaccurate 
estimations  still  provide  a  value  that  can  be  used  for  general  comparisons.  Figure  1 1 
shows  how  fillLin  fills  in  the  missing  data  from  collected  data  points  with  linear 
approximations  to  form  a  continuous  function  where  the  input  is  an  integer  from  1  to  at 
most  129. 
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Figure  11  Example  of  fill  Lin  Approximation  for  y  =  x2 . 

The  graph  in  Figure  12  shows  the  application  of  the  function  fill  Lin  to  the  data 
collected  for  the  net  delays.  The  stems  represent  the  actual  data  points  collected.  This 
means  that  propagation  delays  were  collected  for  several  fanouts.  If  a  designer  needs  to 
find  the  net  delay  for  a  particular  node  with  a  fanout  of  100,  it  is  easy  to  extract  that 
information  from  the  array  created  by  fillLin.  Data  collected  for  several  components  is 
shown  in  Appendix  B.2.  The  graphs  of  fillLin  results  for  these  data  points  are  shown  in 
Appendix  B.3. 


Figure  12  fillLin  Function  (Using  Data  Points  from  Net  Delay). 
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The  fill  Lin  function  yields  an  accurate  representation  without  having  to  collect 
data  points  to  fill  the  entire  x-axis.  The  accuracy  of  the  fillLin  function  is  not  analyzed  in 
depth  in  this  thesis  because  the  errors  in  estimation  are  relatively  minute.  For  example,  a 
visual  inspection  of  Figure  12  shows  that  when  the  largest  jump  between  data  points 
occurs  between  fanout  values  of  81  and  127.  The  approximate  distance  in  net  delay 
between  the  two  fanouts  is  0. 1  ns.  Assuming  basic  knowledge  of  net  delay  vs.  fanout,  we 
can  say  that  net  delay  is  monotone  increasing  between  successive  data  points.  Therefore, 
the  maximum  error  possible  for  fanout  is  0.1  ns,  which  is  relatively  minute.  The  actual 
error  is  most  likely  much  smaller  than  0.1  ns.  However,  when  fewer  data  points  are 
collected,  the  relative  errors  can  be  large.  To  minimize  these  errors,  specific  data  points 
are  collected  based  on  analysis  of  component  architectures. 

When  collecting  data  to  enter  into  the  function,  data  points  were  collected  at  key 
positions  so  that  a  piecewise  linear  approximation  of  the  complexity  and  delay  equations 
would  be  accurate,  ft  was  verified  that  midpoints  corresponded  to  projected  linear 
approximations.  The  purpose  of  the  steps  above  is  to  develop  a  function  that  returns  the 
delay  or  complexity  of  a  circuit  element  based  on  the  number  of  input  bits,  and  the  type 
of  element.  For  example,  if  an  NFG  requires  a  23x23  bit  multiplier,  the  function  returns 
an  accurate  time  delay  without  building  and  synthesizing  it;  its  complexity  and  delay  are 
computed  by  interpolating  between  a  value  of  n  above  and  below  n= 23. 

In  some  cases,  it  was  possible  to  determine  an  actual  function  from  the  data 
points.  For  example,  the  delay  of  an  RCA  versus  word-size  is  a  linear  function  for  n  >4. 
For  these  instances,  the  linear  equation  is  used  to  approximate  time  delays  and/or 
complexity,  and  ‘if  statements  replace  the  delay  value  for  data  points  that  don’t  fit  the 
approximation  equation.  In  the  case  of  the  RCA,  for  n=  1  to  4,  a  simpler  architecture  is 
possible,  so  specific  data  points  are  used  to  give  the  delay  and  size  estimates.  In  general, 
all  devices  exhibit  nonlinear  behavior  of  delay  and  size  versus  n  when  n  is  small  because 
there  are  multiple  ways  route  signals  inside  each  slice  of  the  FPGA.  Each  of  these  signal 
paths  have  different  delays  based  on  the  particular  electronic  device  through  which  it  is 
routed. 
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Data  was  collected  at  various  word-widths  n  for  net  delays,  which  are  based  on  an 
//-bit  fanout,  /?x/?-bit  unsigned  multipliers,  /7-bit  RCAs,  //:  I  MUXs,  /7-address  bit 
distributed  RAM//?-input  functions//7-address  bit  ROM  and  BRAMs.  Other  devices  can 
be  constructed  from  these  basic  elements. 

2.  Measuring  Hardware  Complexity 

It  is  difficult  to  measure  hardware  utilization  when  there  are  different  types  of 
resources,  each  having  a  different  quantity.  This  section  describes  the  how  each  resource 
is  consumed,  and  how  a  single  measure  can  be  used  to  describe  overall  hardware 
utilization  based  on  the  utilization  of  each  resource. 

a.  Deciding  on  the  Basic  Units  of  Measurement 

Since  there  are  multiple  ways  to  organize  the  basic  signal  flow  through  a 
CLB,  it  is  complicated  to  find  a  common  method  to  quantify  how  much  space  a  circuit 
takes  up.  In  some  instances,  a  device  might  use  only  1  LUT,  but  also  use  multiple  MUXs 
in  the  same  slice.  Thus,  even  when  only  1  LUT  is  used,  it  may  still  prevent  the  use  of  the 
rest  of  the  slice  by  other  circuitry.  The  synthesis  reports  from  Xilinx  ISE  Project 
Manager  include  the  number  of  slices  used,  a ,  and  the  number  of  LUTs  used,  () . 
However,  a  may  be  more  than  2/?,  suggesting  that  not  all  of  the  slices  use  both  of  its 
LUTs.  For  this  reason,  we  measure  hardware  utilization  in  terms  of  slices  utilized. 
Doing  so  puts  everything  in  common  terms  that  are  verifiable  with  the  software  being 
used. 

Likewise,  the  synthesis  reports  include  the  number  of  MULT  18x1 8s  and 
BRAMs  used  in  a  particular  design.  No  partial  resources  are  used.  Even  if  only  2  bits  of 
a  MULT  18x1 8  are  used,  it  consumes  the  entire  resource.  If  only  2-bytes  of  RAM  are 
implemented  in  a  BRAM,  then  it  consumes  the  entire  block  of  memory.  Thus,  the  basic 
unit  for  measurement  of  MULT  18x1 8s  is  1  MULT  18x1 8,  and  the  basic  unit  for  BRAMs 
is  1  BRAM. 
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b.  Finding  Meaningful  Terminology  for  Measuring  Hardware 
Utilization 

Since  three  resources  are  considered,  there  are  three  terms  for  hardware 
utilization.  The  slice  utilization  percentage  (SUP)  is  defined  as  percentage  of  the  slices 
that  are  required  in  order  to  implement  a  specific  logic  circuit  design,  based  on  the  data 
from  the  synthesis  reports  (see  Appendix  B.l).  Likewise,  the  multiplier  utilization 
percentage  (MUP)  and  BRAM  utilization  percentage  (BUP)  are  defined  as  the 
percentages  of  respective  resources  used  to  implement  a  specific  circuit  design.  Table  2 
summarizes  the  equations  for  calculating  these  measures,  using  the  quantities  of 
resources  given  in  Table  1. 


SUP=  #  slices  utilized  xl00%  =  #  slices  utilized  xl00% 


total  #slices  on  FPGA 


33792 


A/rTT_,  #  MULT  18x1 8s  utilized  .....  #  MULT  18x1 8s  utilized  ..... 

MUP= - x  1 00%  = - x  1 00% 


total  #MULT18xl8s  on  FPGA 


144 


Bl)p=  #  BRAM  utilized  x  m%=  #  BRAM  utilized  x  m% 


total  #BRAM  on  FPGA 


144 


HUP  =  100%  -  1 00%  -  SUP)  ( 1 00%  -  MUP)  ( 1 00%  -  B  UP) 


Table  2  Equations  for  SUP,  MUP,  BUP,  and  HUP. 


It  is  often  useful  to  compare  devices  that  use  more  than  one  resource  at  a 
time.  For  example,  large  multipliers  consume  onboard  MULT  18x1 8s,  but  also  require 
partial  product  adders  which  consume  CLBs.  Consider  comparing  an  NFG  that  uses  this 
multiplier  with  one  that  uses  a  large  ROM  instead.  The  ROM  might  consume  only 
BRAMs.  A  SUP,  MUP  and  BUP  can  be  calculated  and  compared  for  each  NFG,  but 
there  is  no  way  to  compare  overall  hardware  utilization.  For  this  reason,  the  hardware 
utilization  percentage  (HUP)  is  computed  as  a  function  of  the  SUP,  MUP  and  BUP.  The 
function  shown  in  Table  2  is  used  because  it  exhibits  desirable  characteristics.  When  any 
single  resource  is  consumed  (i.e.  SUP,  MUP,  or  BUP  >  100%  ),  HUP  =  100%,  indicating 
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that  the  required  resources  are  not  available  on  the  Xilinx  Virtex-II  XC2V6000  FPGA. 
This  does  not  necessarily  mean  that  the  NFG  cannot  be  implemented  on  this  particular 
FPGA.  It  means  that  the  models  developed  in  this  thesis  no  longer  provide  accurate 
estimations  for  HUP  and  delay.  Each  model  assumes  that  particular  components  are 
used.  For  example,  if  an  NFG  requires  169  MULT  18x1 8s,  it  could  be  possible  to 
implement  it  on  a  single  FPGA  by  building  the  additional  25  multipliers  from  CLBs. 
However,  the  models  do  not  take  this  into  account.  Thus,  when  the  HUP  for  a  particular 
NFG  reaches  100%,  it  shows  that  the  models  will  not  be  able  to  accurately  represent 
complexity  and  delay  for  larger  NFG  sizes. 

When  a  particular  logic  device  uses  all  three  resources  proportionally  (i.e. 
SUP=MUP=BUP),  then  the  HUP  function  behaves  linearly.  When  only  one  resource  is 
consumed  the  HUP  function  behaves  like  a  cubed-root  function.  The  cubed  root  function 
still  offers  a  meaningful  relation  between  hardware  utilizations  of  NFGs  that  use  different 
resources.  As  more  hardware  is  used,  the  HUP  increases.  The  HUP  increases  slightly 
less  than  it  would  if  all  resources  are  consumed  proportionally.  Figure  13  shows  an 
example  where  the  hardware  resources  are  used  proportionally  (i.e.  MUP=SUP=BUP), 
where  slices  are  used  without  any  other  resources  (BUP=MUP=0),  and  where  slices  and 
MULT  18x1 8s  are  used  proportionally  but  without  any  BRAMs  (MUP=SUP,  BUP=0). 


HUP  versus  SUP 


Figure  13  HUP  vs.  SUP  for  Various  BUPs  and  MUPs. 
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Since  the  variables  SUP,  MUP  and  BUP  are  weighted  evenly  within  the 
HUP  equation,  the  same  relationships  apply  when  a  single  resource  is  used,  regardless  of 
what  resource  is  used.  In  general,  arithmetic  components  do  not  consume  all  three 
resources  proportionally.  Multipliers  consume  MULT  18x1 8s  and  CLBs  in  uneven 
proportions,  and  coefficient  tables  consume  BRAMs  and  CLBs  in  uneven  proportions. 
The  majority  of  the  arithmetic  components  analyzed  in  this  thesis  consume  only  one  type 
of  resource.  Figure  14  shows  another  example  where  the  BUP=SUP  for  various  MUP. 
When  the  MUP=0,  the  HUP  curve  shows  two  resources  being  consumed  proportionally. 
When  MUP=50%,  note  that  the  HUP  begins  at  approximately  20%.  Thus,  when  50%  of 
the  MULT  18x1 8s  are  used,  it  is  considered  that  at  least  20%  of  the  total  FPGA  resources 
are  used.  When  95%  of  the  MULT  18x1 8s  are  used,  at  least  90%  of  the  total  hardware  is 
used. 


HUP  versus  SUP 


Figure  14  HUP  versus  SUP  where  BUP=SUP  for  various  MUPs. 

It  should  be  noted  that  the  HUP  equation  in  Table  2  does  not  exhibit 
desirable  properties  when  SUP,  MUP,  or  BUP  are  greater  than  100%,  thus  the  MATLAB 
function  HUP.m  caps  each  at  100%.  This  produces  a  maximum  HUP  of  100%.  When 
HUP  =  100%,  it  indicates  that  the  complexity  and  delay  of  the  NFG  being  analyzed  is  not 
accurate  because  there  are  not  enough  of  at  least  one  of  the  resources  that  it  needs. 
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3.  Measuring  Propagation  Delay 


The  goal  of  this  section  is  to  determine  how  to  accurately  measure  the 
propagation  delay  of  a  given  circuit,  without  having  to  build  that  particular  circuit  and 
simulate  it.  Signal  propagation  delay  depends  on  the  path  over  which  the  signal 
propagates.  Thus,  the  general  architecture  of  the  circuit  must  be  understood  in  order  to 
know  what  delays  are  encountered  by  a  given  signal.  In  this  section,  we  are  concerned 
with  finding  the  longest  propagation  delay  for  each  particular  circuit.  In  cases  where 
architectures  are  simple,  such  as  the  adder  (section  E.l),  accurate  expressions  are 
straightforward.  For  other  cases,  such  as  the  multiplier  (section  E.2),  data  is  collected 
from  simulation  results  and  estimates  are  made  to  represent  missing  data.  In  both  cases, 
it  is  important  to  understand  the  source  of  the  delays.  Timing  data  was  acquired  using  a 
low-level  synthesis  tool  in  Xilinx  ISE  Project  Navigator.  In  some  cases,  it  was  simple  to 
correlate  the  timing  data  from  the  synthesis  reports  to  the  data  supplied  in  [18].  In  other 
cases,  timing  data  from  the  synthesis  reports  alone  was  used.  The  following  delays  are 
discussed  to  better  understand  their  contribution  to  propagation  delay. 

a.  Net  Delay 

Net  delay  (tna  )  is  common  to  all  circuits  designed  on  FPGAs.  Net  delay 

is  a  propagation  delay  that  is  due  to  transferring  charge  along  a  wire.  It  is  proportional  to 
the  size  of  the  wire  or  conductor  and  inversely  proportional  to  drive  strength  of  the 
associated  power  supply.  DC  power  supplies  can  only  supply  a  limited  amount  of 
current.  On  an  FPGA,  the  drive  strength  for  a  given  node  is  dependent  on  what  driver, 
such  as  a  logic  gate,  register  or  IOB,  is  connected  to  the  node.  The  time  it  takes  to  charge 
a  given  wire  to  a  desired  voltage  is  also  dependent  on  the  fanout  of  the  driver.  If  the 
driver  supplies  charge  to  more  inputs,  then  more  charge  is  required,  resulting  in  a  longer 
time  delay  for  the  entire  wire  to  build  to  the  required  voltage.  Figure  15  shows  an 
example  of  a  schematic  circuit  built  in  Xilinx  Project  Navigator  to  collect  net  delay  data 
for  various  fanouts.  Appendix  B.2  contains  the  data  collected  for  net  delays. 
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Figure  15  Schematic  Example  of  Various  Fanouts. 

When  designing  arithmetic  logic  devices,  the  net  delay  is  significant 
because  some  architecture  have  relatively  large  fanouts.  Net  delays  on  the  Xilinx  Virtex- 
II  XC2V6000  FPGA  with  speed  grade  of  -4  ranges  from  0.517  ns  to  1.316  ns  based  on 
synthesis  reports  for  various  circuits.  Figure  16  shows  the  net  delay  versus  fanout  that  is 
generated  by  the  function  fillFin  when  given  the  collected  data  as  an  input.  Although  the 
net  delay  is  generally  smaller  than  the  delay  of  logic  components,  when  multiple  logic 
stages  with  high  fanouts  are  cascaded,  the  associated  net  delays  can  be  a  significant 
contribution  to  the  total  combinational  delay  of  the  circuit. 
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Figure  16  Net  Delay  vs.  Fanout  after  fill  Lin . 

When  estimating  propagation  delays  for  various  arithmetic  devices  (see 
Section  E  in  this  chapter),  the  file  HUand Delay  includes  the  net  delay  going  into  the 
particular  device.  However,  it  excludes  the  net  delay  associated  with  the  output  because  it 
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depends  on  the  number  of  inputs  driven  by  the  output.  This  simplifies  the  calculation  of 
propagation  delays  for  composite  circuits.  Figure  17  illustrates  the  propagation  delays 
associated  with  combining  two  arithmetic  devices  in  series.  The  total  propagation  time 
through  the  composite  circuit  is tprop=tnetl+tcombX+tne,A  +  tcomb2,  where  tnetK  is  the  net 

delay  associated  with  a  fanout  of  k  ,  and  tcomh  J  is  the  combinational  delay  of  the  j-th 
arithmetic  device  in  series. 


Device  2 

fanout  =  4 

Device  1 

fanout  =  1 

^net,  1  ^ comb, l  ^ net, A  Komb,2 


Figure  17  Propagation  Delay  for  Arithmetic  Devices  in  Series. 

When  arithmetic  devices  are  placed  in  parallel,  the  fanout  of  the  input 
wires  becomes  the  sum  of  the  fanouts  of  each  device,  and  the  net  delay  for  each  device 
requires  adjustment.  If  not,  small  errors  (up  to  0.8  ns)  are  introduced  in  propagation 
delay  estimations  every  time  devices  are  placed  in  parallel.  In  most  NFGs,  this  error  is 
insignificant,  but  this  thesis  uses  the  correct  net  delays.  Figure  18  illustrates  how  this 
error  affects  the  propagation  delay  of  the  composite  circuit. 


fie/,1  ^ comb ,  I 


Figure  18  Propagation  Delay  for  Arithmetic  Devices  in  Parallel. 
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b.  LUT  Delays 


LUT  delays  are  the  propagation  delays  associated  with  a  signal 
propagating  from  the  input  of  LUT  (or  function  generator)  to  the  output  of  the  LUT. 
LUT  delays  are  denoted  as  tLUT  ,  where  q  e  and  I  <q<6.  [18]  reports  tLUT4  to  be 

0.44ns,  and  synthesis  reports  demonstrate  this  value  to  be  0.439ns.  The  delay  is  the  same 
for  LUTs  even  if  all  four  inputs  are  not  used.  Thus,/),,.,,  =tLUT2  =tLUTi  =  tLUT4 .  Five- 
input  LUTs  can  be  formed  using  two  4-input  LUTs  and  a  specialized  MUX  within  the 
same  slice.  According  to  [18],  tLUT5  =  0.72 ns  .  The  additional  delay  is  due  to  the  MUX 
that  is  needed  to  combine  two  4-input  LUTs  to  form  a  5-input  LUT. 

c.  Delays  in  Special  Purpose  MUXs 

As  discussed  previously,  there  are  various  MUXs  in  each  slice  that  can  be 
configured  for  use  in  design  of  a  logic  circuit.  This  section  identifies  some  of  the 
propagation  delays  associated  with  the  MUXs  that  are  used  in  the  arithmetic  devices  in 
this  thesis. 

MUXCY,  shown  in  Figure  8  provides  a  path  for  fast  carry  logic  used  to 
implement  an  adder.  The  two  delays  of  concern  are  tMUXCYS^0  and  tMUXCY  /0_>o  .  The  first 

delay,  tMUXCY  ,  is  the  time  it  takes  to  change  the  output  O,  after  the  select  line  S 

changes.  The  second,  tMUXCY  I0^o  ,  is  the  propagation  delay  of  a  signal  from  input  10  to 

the  output  O.  Empirical  evidence  from  Xilinx  ISE  Project  Navigator  confirms  data  from 
[18]  for  the  values  in  Table  3 


Parameter 

Time  delay  (ns) 

^ MUXCY, 1 0^0 

0.053 

t MUXCY  ,S->0 

0.298 

Table  3  MUXCY  Propagation  Delays. 
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MUXFX  is  designed  to  combine  signals  from  multiple  slices  into  a  single 
output.  This  is  useful  when  constructing  functions  of  more  than  4  variables.  For 
example,  instead  of  cascading  multiple  layers  of  2:1  MUXs  built  from  LUTs,  larger 
MUXs  are  constructed  from  the  built-in  MUXFXs.  This  eliminates  the  net  delays 
associated  with  interconnecting  LUTs.  For  example,  a  4-input  function  takes  0.44ns  plus 
a  net  delay  to  produce  a  result,  while  a  5-input  function  takes  only  0.72ns  and  a  net  delay 
(vice  2x0.44 ns  =  0.88 ns  and  two  net  delays  for  two  cascaded  LUTs). 

d.  IOB  Delay 

Timing  data  was  acquired  using  a  low-level  synthesis  tool  in  Xilinx  ISE 
Project  Navigator.  The  synthesis  includes  estimated  routing  delays  (net  delays), 
combinational  delays,  and  Input/Output  Buffer  (IOB)  delay.  Since  NFGs  would  most 
likely  cascade  multiple  arithmetic  and/or  memory  units  together,  IOB  delay  data  is 
removed  from  the  total  delay  for  the  particular  component.  For  example,  the  total  delay 
of  an  NFG  that  is  comprised  of  a  RAM  unit  propagating  into  a  multiplier,  then  into  an 
adder,  is  the  sum  of  the  combinational  delays  of  each  component  and  the  estimated 
routing  delays.  The  low  level  synthesis  provides  timing  data  along  the  longest 
combinational  path,  and  includes  the  IOB  delays,  net  delays,  and  combinational  delays 
based  on  the  routing  through  each  slice.  The  data  collected  in  LoadlSEDeviceData 
removes  the  IOB  delays  and  contains  the  net  delays. 

D.  ESTIMATING  PARAMETERS  FOR  VARIOUS  BASIC  ARITHMETIC 
LOGIC  COMPONENTS 

Various  NFG  configurations  require  various  arithmetic  logic  devices  in  series 
and/or  in  parallel.  This  section  discusses  measuring  the  complexities  and  propagation 
delays  for  common  arithmetic  logic  devices  applicable  to  NFGs.  It  describes  simple 
architectural  designs  for  several  circuits,  which  are  not  necessarily  the  most  efficient  or 
compact  designs.  The  goal  is  not  to  find  the  best  case  hardware  design,  but  to  use 
commonly  accepted  methods  to  build  basic  arithmetic  circuits  in  order  to  compare 
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complexities  and  propagation  delays.  The  measurements  of  the  arithmetic  circuits  in  this 
section  are  used  to  measure  the  overall  complexity  and  delays  for  the  NFG  configurations 
that  are  built  from  them. 

The  author’s  MATLAB  function  HUandDelay.m  calculates  the  SUPs,  MUPs, 
BUPs,  and  delays  of  several  components.  These  parameters  are  calculated  based  on  the 
particular  component  having  n  input  bits  and  w  output  bits.  The  number  of  output  bits  is 
only  used  for  memory  components  and  SIEs.  Table  4  summarizes  each  function  handled 
by  HUandDelay. 


Inputs  variables 

Output  Variables : 

n,w 

Device  Name 

SUP,  MUP,  BUP  and  propagation  delay  for  a(n): 

n,w 

‘ROM’ 

‘LUT’ 

M-input  w-output  function,  or  a  single  bit  ROM  with  n  address  lines  (  2"  x  w  ROM). 

n,w 

‘Adder’ 

adder  with  2  input  vectors  of  length  n  and  a  carry  in  bit,  and  a  single  output  vector 

of  length  n,  plus  a  carry  out  bit.  Note:  w  is  not  used. 

n,w 

‘Mult’ 

multiplier  with  2  input  vectors  of  length  n,  and  a  product  vector  of  length  2 n  (built 

from  CLBs  only,  no  MULT  18x1 8s  are  used) 

n,w 

‘Multl8xl8’ 

multiplier  with  2  input  vectors  of  length  n,  and  a  product  vector  of  length  2 n  (built 

from  CLBs  and  MULT18xl8s) 

n,w 

‘MUX’ 

n:  1  MUX,  with  n  + 1~ log2  H\  input  bits,  and  1  output  bit 

n,w 

‘BarrelShifter’ 

/7-bit  barrel  shifter  with  n  + 1~ log2  n  |  input  bits  and  n  output  bits 

n,w 

‘BRAM’ 

memory  unit  constmcted  from  onboard  BRAM  units,  with  /?  address  bits  in,  and  w- 

bits  out  ( 2"  x  w  RAM). 

n,w 

‘SIE’ 

Segment  index  encoder  with  n  input  bits,  and  w  output  bits. 

n,w 

‘SOP’ 

worst  case  Sum  of  Products  logic  circuit  with  n  inputs  and  w  output  bits 

n,w 

‘Mem’ 

best  case  memory  unit  constructed  from  BRAMs  or  from  ROMs,  2"  xw  ROM 

Table  4  Summary  of  “HUandDelay”  Operations. 
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The  component  designs  do  not  necessarily  represent  the  best  case  design  or  the 
worst  case  design.  They  are  merely  working  designs  that  have  been  constructed  from 
either  behavioral  VHDL  models  or  from  schematic  models  that  can  be  implemented 
efficiently  onboard  the  FPGA.  Bit  widths  up  to  129  bits  wide  are  analyzed. 

1.  Adders  and  Subtractors 

Adders  and  subtractors  are  commonly  used  arithmetic  logic  devices.  Since  a 
subtractor  can  be  constructed  with  almost  equivalent  complexity  to  an  adder,  only  the 
adder  architecture  is  analyzed.  For  NFGs  that  require  subtractors,  adders  are  substituted 
because  they  exhibit  the  same  characteristics. 

a.  Architecture 

Xilinx  FPGA  architecture  has  been  specifically  designed  for  fast 
mathematic  operations,  including  additions  and  multiplications.  Fast  carry  chains  are 
built  in  columns  that  run  through  each  slice  via  fast  MUXs,  namely  the  MUXCY  (see 
Figure  8).  The  propagation  delay  from  one  bit  to  the  next  is  approximately  53  ps.  Even 
large  RCAs  can  compute  a  large  number  of  bits  relatively  quickly.  Each  fast  carry  chain 
can  be  176  bits  long  [18].  This  means  that  the  carry  propagation  portion  of  the  adder’s 
delay  is  only  9.3  ns  for  a  176-bit  adder.  Longer  carry  chains  can  be  constructed  by 
connecting  the  last  carry  out  to  another  fast  carry  chain,  but  associated  net  delays  are 
attached.  However,  an  adder  wider  than  176-bits  is  not  generally  required  in  NFGs. 
Contrary  to  conventional  logic  design,  using  Carry  Look-Ahead  (CLAH)  architecture 
actually  produces  slower  adders  due  to  the  additional  XOR  logic  depth.  Figure  19  shows 
how  a  single  bit  full  adder  is  implemented  using  a  LUT,  MUXCY,  and  XOR  within  half 
of  a  slice.  Note  that  each  LUT  is  configured  as  a  two-input  XOR  gate,  having  the  same 
delay  as  a  2-input  LUT,  tLUT2 . 
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Cout 


Figure  19  Single-bit  Full  Adder  Implemented  on  Virtex-II  FPGA. 


b.  Complexity  Analysis 


The  goal  of  this  complexity  analysis  is  to  find  an  accurate  method  to 
quantify  hardware  utilization  for  adders  based  on  the  size  of  the  adder.  Figure  20  shows 

n 


the  logic  and  carry  path  of  an  /? -bit  full  chain  implemented  in 


slices.  Thus,  an  n-bit 


adder  occupies 


slices.  Empirical  data  in  the  synthesis  reports  also  confirms  this 


relationship.  The  number  of  slices  is  calculated  using  the  ceiling  function  in  the  author’s 
function  HUand Delay  (Appendix  A.2)  and  is  used  to  find  the  SUP  (Table  2).  Because 
adders  do  not  use  MULT  18x1 8s  or  BRAMs,  the  function  returns  MUP=0  and  BUP=0  for 
an  /7-bit  adder. 


c.  Delay  Analysis 

The  propagation  delay  of  an  RCA  is  linear.  Behavioral  models  for  adders 
implement  RCAs  on  the  Virtex-II,  so  the  propagation  delay  is  expected  to  be  linear.  Data 
collected  from  the  synthesis  reports  confirm  this  for  n  >  4.  The  data  used  for  propagation 
delays  does  not  include  IOB  delays,  but  does  include  net  delays.  By  tracing  the 
propagation  path  given  by  the  synthesis  reports,  as  shown  in  Figure  20,  the  total  delay  is 
derived  to  be  tprop  =  tMUXCYI^J{n  - 2)  +  tmerhead,  where  tMKYCy  10^o  =  0.053«s  and 

t overhead  =2.528 ns  .  According  to  [18],  the  carry  delay  through  the  fast  MUXCY  from 
input  10  to  output  O  is  0.05  ns,  correlating  the  theoretical  expectation  and  empirical  data 
to  the  manufacturer’s  specifications.  Also,  there  is  no  carry  propagation  in  a  single-bit  or 
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a  2-bit  adder  since  they  are  within  the  same  slice.  For  larger  RCAs,  the  first  and  last 
MUXCY  do  not  lie  in  the  longest  propagation  path.  Thus,  the  delay  along  the  carry 
propagation  path  is  proportional  to  n- 2,  and  the  overhead  delay  accounts  for  the  rest  of 
the  time  delay  through  the  RCA.  Figure  20  shows  the  total  propagation  delay  path 
through  a  RCA  implemented  on  the  Xilinx  Virtex-II  XC2V6000  FPGA. 


Figure  20  An  n-bit  RCA  Propagation  Delay  Path  on  Xilinx  Virtex-II.  (After  [18]). 

The  remaining  portion  of  the  equation  can  be  verified  by  breaking  down 
t overhead  ml°  the  delays  of  the  other  logic  components  with  the  slices  containing  the  LSB 
and  the  MSB.  Switching  characteristics  for  these  components  (in  [18])  correspond  to  the 
signal  path  delays  found  in  the  synthesis  reports.  For 

example,  toverhead  =  tXORCY  +  tnet^  +  tLUT 2  /0_>0  + t  muxcy  ,s->o  ■>  where  ^xorcy  =  1  •274ns  , 

tneta,  =  0.5 1 Ins ,  tLUT2IQ^0  =  0.439ns,  and t MUXCY  s^0  =  0.298ns  .  The  explanation  of 

these  terms  can  be  found  in  [19],  but  are  illustrated  in  Figure  20.  The  synthesis  reports  in 
Appendix  B.l  show  the  delay  of  each  component  in  the  overhead  common  to  all  n-bit 
RCAs.  Since  a  linear  equation  can  accurately  (to  within  0.01ns)  represent  the  simulation 
data,  the  author’s  MATLAB  function  HUandDelay  returns  the  propagation  delay  of  an 
n- bit  RCA  by  calculating  it  with  the  same  linear  equation  instead  of  using  a  table  of 
referenced  values.  This  allows  a  simple  calculation  to  accurately  return  a  valid 
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propagation  delay  for  a  given  RCA  size  n.  Figure  21  shows  the  overall  SUP  and 
propagation  delay  for  adders  versus  the  size  of  the  adder. 


Slice  Utilization  Percentage 


Figure  21  SUP  and  Propagation  Delay  for  n-bit  RCAs. 

2.  Multipliers 

In  order  to  understand  hardware  utilization  and  propagation  delays  for  multipliers, 
it  is  necessary  to  understand  their  architecture 

a.  Architecture 

Array  multipliers  generally  require  partial  product  generators  (PPGs)  and 
PP  adders.  Figure  22  shows  the  general  architecture  of  an  n-bit  multiplier  using  PPGs 
and  RCAs.  The  hardware  utilization  percentage  (FIUP)  and  propagation  delay  of  an  nxn- 
bit  array  multiplier  depend  on  the  number  of  PP  multipliers  required  and  the  number  of 
PPs  that  need  to  be  added  together.  Relatively  large  multipliers  may  need  to  be  analyzed 
for  some  of  the  applications  in  this  paper.  Xilinx’s  Virtex-II  XC2V6000  FPGA  includes 
144  1 8x1 8-bit  signed  multipliers.  Each  one  can  be  used  as  an  nxn- bit  multiplier  for  n< 
18,  or  as  a  PPG  for  larger  multipliers. 
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Figure  22  General  nxn  Array  Multiplier  Architecture. 


The  size  of  an  array  multiplier  depends  on  the  number  of  bits  being 
multiplied.  It  also  varies  depending  on  the  size  of  the  PPGs.  The  most  basic  PPG  is  the 
lxl  bit  multiplier,  which  is  an  AND  gate.  A  2x2  bit  multiplier  is  a  4-input  to  1 -output 
function,  which  can  be  realized  in  four  LUTs.  Since  the  number  of  function  inputs  grows 
proportional  to  n1 ,  the  multiplier  becomes  very  complex  for  larger  PPGs  if  LUTs  are  used 
to  realize  the  function.  An  nxn- bit  multiplier  designed  with  the  architecture  in  Figure  22, 
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proportionality  of  multipliers’  SUPs  to 


+  1  r-bit  adders.  Figure  23  illustrates  the 
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.  As  r  gets  smaller,  more  adders  are  required 


to  sum  the  partial  products.  Figure  23  shows  the  FIUP  and  propagation  delays  for  a 
multiplier  with  r=4  built  from  4-bit  PPGs  and  4-bit  RCAs.  It  compares  them  to 
multipliers  built  using  the  MULT18xl8s.  Using  MULT18xl8s  reduces  the  SUP  for  a 
multiplier,  but  also  increases  the  MUP  (see  Table  2). 
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Hardware  Utilization  for  Multipliers 
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Figure  23  Multiplier  HUP  and  Delay  vs.  Multiplicand  Size  for  Multipliers  Built  with 

MULT  18x1 8s  vs.  CLBS. 

Figure  23  shows  that  it  is  more  efficient  to  develop  large  multipliers  using 
the  MULT  18x1 8s  on  the  Virtex-II  FPGA.  Here,  the  lower  line  represents  multipliers 
built  from  MULT18xl8s  only,  and  the  upper  line  represents  multipliers  built  from  LUTs 
only.  Each  one  can  be  used  as  an  nxn- bit  multiplier  for  n<  18,  or  be  an  r-bit  PPG  for 
larger  multipliers,  where  r  <=17.  Doing  so  takes  advantage  of  the  benefits  discussed  in 


For 


section  A.  Lb.  The  propagation  through  each  PPG  is  a  linear  function  of 
multipliers  with  n>\l ,  all  of  the  PPs  can  be  calculated  in  parallel.  This  makes  it  more 


time-efficient  to  split  77-bit  multiplicands  into 


-bit  multiplicands  for  each  PPG,  rather 


than  using  the  maximum  number  of  bits  in  a  single  MULT  18x1 8  with  fewer  bits  in  the 
other  required  MULT18xl8s.  For  example,  if  a  24x24  bit  multiplier  is  required  (Figure 
24),  it  takes  less  time  to  compute  four  12x1 2-bit  multiplications  in  parallel  using  the 
MULT18xl8s  than  it  takes  to  compute  one  1 7x1 7-bit  multiplication  in  parallel  with  two 
7x1 7-bit  multiplications  and  a  7x7-bit  multiplication  (Figure  24).  This  is  because  the 
delay  of  the  17x1 7-bit  multiplier  takes  longer  than  any  of  the  other  multiplications 
because  the  MSB  of  its  product  would  come  off  of  pin  34. 
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Figure  24  24-bit  Multipliers  with  Uneven  and  Even  PPs. 

Since  modem  FPGAs  incorporate  multipliers,  this  analysis  is  usable  for 
many  other  hardware  applications  as  well.  Array  multipliers  may  be  better  designed 
using  combinational  logic.  However,  large  multipliers  require  a  larger  portion  of  the 
CLBs  on  the  FPGA  and  a  much  longer  propagation  delay.  It  is  more  efficient  to  use  a 
few  of  the  onboard  MUFTI 8x1 8s  so  that  the  CFB  resources  are  available  for  other 
required  logic  devices.  A  32x32  bit  multiplier  built  from  combinational  logic  consumes 
24.9%  of  the  slices  on  the  FPGA  and  takes  29.9  ns  to  produce  a  result.  The  same 
multiplier  built  using  MUFT18xl8s  consumes  only  0.14%  of  the  slices  and  2.8%  of  the 
MUFT18xl8s,  and  has  a  propagation  delay  of  17.7  ns.  Since  the  objective  is  to  establish 
a  basic  way  to  compare  NFGs,  and  not  to  develop  the  most  efficient  nxn- bit  multiplier, 
using  the  MUFTI 8x1 8  onboard  the  FPGA  as  PPGs  is  a  sufficient  and  reasonable  method 
to  build  large  multipliers,  and  results  in  a  shorter  propagation  delay. 

b.  Complexity  Analysis 

Determining  the  size  of  a  multiplier  is  much  more  complicated  than  the 
size  of  adders.  For  multipliers  with  n<18,  a  single  MUFT18xl8  can  be  used,  thus  the 

percentage  of  MUFTI 8x1 8s  used,  or  MUP,  is Y\\\K  0.7%.  When  more  than  one 

MUFTI 8x1 8  is  required,  r-bit  adders  are  required  to  sum  the  PPs.  These  /'-bit  wide  PP 
adders  consume  CFBs.  Therefore  two  parameters  must  be  measured  for  any  circuit 
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design  that  incorporates  the  on-chip  MULT  18x1 8s:  MUPs  and  SUPs.  If  either  the  MUP 
or  the  SUP  exceeds  100%,  then  the  circuit  being  implemented  will  not  fit  on  the  FPGA. 
The  HUP  is  shown  in  Figure  23. 

Because  array  multipliers  can  be  very  complex,  and  can  be  constructed  in 
various  ways,  it  is  not  feasible,  nor  necessary,  to  dive  deep  into  the  architecture  to 
analyze  complexity  in  terms  of  CLBs.  The  adder  is  described  in  such  a  way  that  the 
architecture  and  product  specification  validated  simulation  results  from  the  synthesis 
reports.  Since  simulation  data  was  proven  accurate  for  adders,  it  is  assumed  accurate  for 
multipliers.  Thus  behavioral  models  of  unsigned  multipliers  were  designed  and 
synthesized  using  ISE  Project  Navigator  for  various  word  widths.  The  synthesis  reports 
provide  the  number  of  slices  and  MULT  18x1 8s  required,  validating  the  quantity 
-  2 

estimated  in  the  architectural  analysis  above.  These  values  are  included  in 

Appendix  B.2,  and  are  imported  by  HUandDelay  to  estimate  hardware  utilization  using 
the  linear  approximation  function  fill  Lin . 

c.  Delay  Analysis 

For  small  multipliers,  where  n<18,  the  propagation  delay  is  that  of  a  single 
MULT  18x1 8.  Larger  multipliers  require  multiple  adders  or  adder  trees.  Again,  the 
design  of  the  multiplier  can  vary  widely,  which  affects  the  delay.  So  to  provide  a  simple 
method  to  provide  relevant  data,  timing  data  is  collected  from  the  synthesis  reports  for 
the  behavioral  models.  The  propagation  delays  for  various  multiplier  sizes  are  provided 
in  Appendix  B.2  and  displayed  in  the  graph  in  Figure  23. 

3.  Multiplexers  (MUXs) 

The  NFG  models  in  this  thesis  do  not  use  MUXs.  However,  they  are  analyzed 
here  so  that  future  models  can  incorporate  them.  MUXs  often  perform  vital  functions 
(such  as  data  signal  routing)  in  arithmetic  logic  devices.  For  example,  in  a  floating  point 
systems  [23],  MUXs  can  be  used  to  select  an  output  from  either  a  computed  value  or 
from  a  special  number  value  (exact  0,  NaN,  ±oo  )  based  on  the  whether  or  not  the  input  is 
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a  special  number  value.  An  n:  1  MUX  has  n  input  bits,  and  routes  only  one  of  these 
inputs  to  the  output  bit  depending  on  the  bits  used  for  selection.  The  number  of  selection 
bits  required  is  [~log2«]  .  For  example,  a  16:1  MUX  has  16  input  bits  (10-115)  and  4 

selection  bits  (S0-S3).  To  route  input  bit  17  to  the  output,  the  selection  bits  must  be 
01 1 12,  or  7io. 


a.  Architecture 

The  Virtex-II  architecture  supports  fast  multiplexing  by  joining  the  LUTs 
within  each  CLB  with  MUXs  built  into  each  slice,  thus  minimizing  propagation  delays 
due  to  connecting  to  logic  blocks  in  other  CLBs.  The  delay  is  a  nonlinear  function  with 
respect  to  size.  By  configuring  each  LUT  to  realize  a  2:1  MUX,  1  slice  can  realize  a  4:1 
MUX  by  using  the  specialized  MUXF5.  Adjacent  slices  can  be  combined  to  form  larger 
MUXs  using  the  specialized  MUXFX  within  each  slice.  Figure  25  illustrates  the 
architecture  of  a  16:1  MUX  built  within  a  single  CLB,  or  4  slices.  MUXs  with  ri>  16  can 
be  built  by  combining  multiple  16:1  MUXs  with  other  MUXs. 


Figure  3-66:  LUTs  and  (MUXF5.  MUXF6.  and  MUXF7)  in  On®  CLB 

Figure  25  16:1  MUX  within  a  Single  CLB.  (From  [18]) 
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b.  Complexity  Analysis 


Since  four  slices  can  implement  a  16:1  MUX,  the  number  of  slices 


required  in  an//:l  MUX  is 


.  To  validate  this  approximation  of  hardware  utilization 


based  on  the  n,  schematic  models  of  MUXs  where  constructed  in  Xilinx  ISE  Project 
Manager.  The  schematics  implement  primitive  MUXs  included  in  Xilinx’s  library.  The 
largest  primitive  MUX  is  a  16:1  MUX,  which  corresponds  to  the  architecture  described 
above.  Larger  MUXs  were  built  by  combining  the  primitive  MUXs.  For  example,  a  32: 1 
MUX  was  constructed  by  coupling  two  16:1  primitive  MUXs  with  a  2:1  primitive  MUX. 
This  method  assures  that  an  n:  1  MUX  is  constructed  in  a  compact  manner.  Synthesis 
reports  for  the  schematic  designs  provided  the  data  in  Appendix  B.2.  The  slice  utilization 
data  confirmed  the  estimates  from  the  architectural  description.  The  SUP  for  an  n:  1 
MUX  is  calculated  using  the  equation  in  Table  2.  Since  no  MULT18xl8s  or  BRAMs  are 
used,  MUP=0  and  BUP  =0  for  all  MUXs  analyzed  in  this  thesis. 


Slice  Utilization  Percentage 


Figure  26  SUP  vs.  MUX  size  (bits). 


c.  Delay  Analysis 


The  propagation  delay  through  a  large  MUX  depends  on  the  number  of 
MUX  levels,  and  the  delay  through  each  particular  MUX.  Since  different  MUXs  are 
used  within  each  CLB,  they  each  have  a  different  propagation  delay.  The  number  of 
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MUX  levels  is[ log2  n~\ .  The  synthesis  reports  provide  propagation  delay  data  for  various 
MUX  sizes.  The  data  confirms  the  logarithmic  relation  between  n  and  propagation  delay. 


Figure  27  Propagation  Delays  vs.  MUX  Size  (bits). 

4.  Barrel  Shifters 

Like  MUXs,  barrel  shifters  are  not  used  in  any  of  the  models  in  this  thesis. 
However,  they  are  analyzed  here  because  they  may  prove  useful  in  reducing  hardware 
complexity  and  delay  for  linear  NFGs  that  restrict  its  slope  coefficients  ( cu  )  to  a  power 
of  2.  Barrel  shifters  can  be  used  to  realize  multipliers  when  one  of  the  multiplicands  is  a 
power  of  2.  They  can  be  significantly  faster  and  require  fewer  slices  than  a  general 
multiplier.  A  basic  /7-bit  barrel  shifter  consists  of  n  ir.\  MUXs  in  parallel.  It  shifts  bits 
from  the  MSB  into  the  LSB,  or  vice  versa.  A  small  amount  of  additional  logic  is  needed 
to  convert  the  basic  barrel  shifter  into  an  arithmetic  or  logical  combinational  shifter. 

a.  Architecture 

Figure  28  shows  the  general  architecture  of  an  //-bit  barrel  shifter, 
including  the  fanouts  along  the  propagation  paths.  The  darkened  MUXs  indicate  that 
they  can  be  considered  a  part  of  an  n:l  MUX,  multiplexing  all  inputs  to  a  single  output 
bit.  The  easiest  method  to  build  a  barrel  shifter  would  be  to  use  n  n:  1  MUXs  in  parallel, 
one  for  each  output  bit.  This  is  a  naive  method  since  it  does  not  reuse  the  2: 1  MUXs  that 
can  be  reused.  A  better  architecture  is  shown  in  Figure  28,  containing  log.,  n  columns  of 
77  2:1  MUXs. 
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b.  Complexity  Analysis 


An  //-bit  barrel  shifter  constructed  in  the  naive  manner  would  consume  n 


/?:!  MUXs,  or 


slices.  The  more  hardware  efficient  method  results  in 


log2  n 


since  each  2:1  MUX  in  Figure  28  can  be  constructed  from  a  single  LUT.  The  function 
HUand Delay  uses  the  latter  method. 


c.  Delay  Analysis 

The  delay  of  an  /7-bit  barrel  shifter  is  closely  related  to  the  delay  of  an  /7 : 1 
MUX.  Because  the  shift-by-1  MUX  select  line  must  be  distributed  to  all  n  2: 1  MUXs  in 
the  first  column,  the  fanout  of  this  line  is  n.  Since  the  longest  propagation  path  contains 
this  select  line,  then  a  net  delay  based  on  that  /?,  instead  of  1,  must  be  accounted  for. 
Therefore,  the  barrel  shifter’s  propagation  delay  is  the  same  as  an  /7:1  MUX  plus  the 
difference  in  net  delays,  or  tprop  BarrelShifter  =  tprop  MUX  +tNET(n)-tNET{[) .  This  equation  is  used 
in  the  function  HUand  Delay  to  return  the  propagation  delay  for  an  //-bit  barrel  shifter. 
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5. 


General  Logic  Functions 


This  section  discusses  various  methods  to  implement  general  functions  based  on  n 
inputs  and  a  single  output.  These  types  of  function  may  be  used  in  NFGs  as  segment 
index  encoders  and  relatively  small  coefficient  tables. 

a.  Generic  n-Input  Functions 

In  the  worst  case,  any  //-input  function  can  be  realized  with  an  //-input 
lookup  table  (LUT),  which  is  functionally  a  ROM.  The  amount  of  required  memory  cells 
is  2"  per  bit.  Most  functions  can  be  reduced  to  smaller  logic  functions,  so  2"  represents 
the  upper  bound  of  the  required  memory  units.  In  Xilinx’s  Virtex-II  FPGA,  each  LUT 
has  4-input  bits,  thus  can  represent  a  4-input  1 -output  function  or  a  16x1  ROM.  Thus,  the 
number  of  LUTs  required  to  realize  any  //-bit  function  or  a  2"xl  ROM  is  2"  4 .  Single¬ 
port  RAM  requires  the  same  amount  of  LUTs,  but  can  be  read  and  written. 

The  delay  of  a  4-variable  function  realized  by  one  LUT  is  0.44ns  [18]. 
The  FPGAs  are  organized  such  that  a  5-input  function  can  be  realized  with  in  one  slice, 
without  having  to  cascade  the  delays,  thus  yielding  a  0.72ns  delay  for  a  5-input  function. 
The  overall  delay  through  an  //-bit  ROM  from  an  input  to  an  output  depends  on  whether 
the  complete  function  is  designed  using  cascades  of  4-bit  functions  or  5-bit  functions. 
Larger  functions  require  combining  4  or  5 -input  functions  with  a  MUX  large  enough  to 
accommodate  a  total  of  //  input  bits.  For  the  purpose  of  NFG  comparisons,  a  ROM 
performs  the  same  function  and  utilizes  the  same  hardware  as  an  //-input  function.  Thus, 
ROM  primitives  were  constructed  schematically  in  ISE  Project  Manager.  The  synthesis 
reports  provided  timing  and  hardware  utilization  data  for  ROM  with  up  to  7-bit 
addresses.  Larger  ROMs  are  constructed  from  the  largest  ROM  primitive.  Thus,  an  //- 
address  bit  ROM  requires  2"~7  7-bit  address  ROMs  and  a  2"  7 : 1  MUX.  An  example 
architecture  is  shown  in  Figure  29.  HUand Delay  imports  timing  data  for  the  primitive 
ROMs  from  the  data  set  in  Appendix  B.2,  and  recursively  calls  itself  to  find  the 
additional  hardware  and  propagation  delay  of  the  required  MUX.  The  propagation  delay 
takes  into  account  the  delays  from  the  ROM,  the  MUX,  and  the  net  delay  associated  with 
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connecting  the  two  devices  together.  Figure  30  shows  the  hardware  utilization  and 
propagation  delay  of  an2"xl  ROM.  Note  that  for  ri>  14,  it  is  more  efficient  to  use 
BRAM  for  implementing  a  large  LUT  instead  of  consuming  a  large  number  of  slices. 
HUand Delay  automatically  selects  BRAM  implementation  for  LUTs  with  n  larger  than 


Figure  29  An  n-input  Function  Using  7-bit  Address  ROMs  and  a  2"  7 : 1  MUX. 


LUT  SUP  and  Delay 


Figure  30  LUT  SUP  and  Delay  vs.  Number  of  Address  Bits  for  a  ROM. 


48 


b. 


Sum  of  Products  (SOP)  Functions 


A  sum-of-product  is  a 


where  p  is 


i=\  y  j=\  j 

the  number  of  terms,  q  is  the  number  of  inputs  into  a  term,  and  g  is  each  input  bit. 
Significant  hardware  and  propagation  delay  reductions  can  be  realized  when  a  particular 
//-input  function  can  be  represented  in  a  SOP  form.  The  Virtex-II  architecture  is  designed 
to  efficiently  implement  wide  SOPs.  It  is  a  difficult  problem  to  determine  the  complexity 
of  SOPs  for  logic  functions.  Benchmark  functions  tend  to  have  small  SOPs  [22]. 
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Figure  25:  Horizontal  Cascade  Chain 

Figure  31  SOP  implemented  on  Virtex-II.  (From  [18]) 


From  analyzing  Karnaugh  Maps  [21]  [22],  the  worst  case  SOP  for  an  //-bit 
input  requires  2'i_I  //-input  minterms.  If  the  LUTs  in  Figure  31  are  configured  to  be  4- 


input  LUTs,  product  minterms  can  be  formed  //  bits  wide,  requiring 


LUTs  per 


product  term.  Since  the  number  of  minterms  required  for  a  worst  case  logic  function 


is  2”  ,  then  the  entire  SOP  circuit  requires  2 


n- 1 


LUTs,  or  2 


n— 2 


slices.  The 


propagation  delay  is  t  =  tneuX  +  tLUTA  + 1 


MUXCY,S->0 


+ 


.  t  1  •  t  SCC 

lMUXCY,IO^O  ^  ^  lORCY  9 
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Appendix  C.2  for  explanation  of  terms.  These  equations  are  used  by  HUandDelay  to 
estimate  propagation  delay  and  hardware  utilization. 


HUP  and  Delay  for  LUTs  and  SOPs 


Figure  32  HUP  and  Propagation  Delay  for  //-input  LUTs  and  /7-input  worst  case 

SOP. 


After  analyzing  the  estimations  in  Figure  32,  it  is  apparent  that  when  the 
actual  function  being  realized  is  not  known,  it  is  much  more  appropriate  to  use  LUTs 
instead  of  SOPs.  However,  when  specific  functions  are  reduced  to  small  SOPs,  the 
worst-case  SOPs  are  not  implemented,  and  a  significant  speed-up  can  occur  with  a 
reduction  in  hardware  utilization.  Consider  a  function  that  can  be  reduced  to  a  sum  of  4 
midterms,  where  each  min  term  has  16  inputs  (Figure  31).  The  number  of  slices  required 


is  4x 


=  16  LUTs,  or  8  slices.  The  corresponding  HUP  is  0.0079%.  The  propagation 


delay  t  tnet  l  +  tLUT4  +  tMuxcY,s^o 


t  -i-4  •  t  —  2  924 ns 

1 MUXCY ,1 0-^0  ^  ^  lORCY  yz-^ri* 


The  same 


function  implemented  using  a  16-bit  LUT  requires  4  BRAMs,  or  a  HUP=0.93%,  with  a 
delay  7.06ns.  Thus,  when  specific  /7-input  functions  are  known  and  can  be  reduced  to 
SOP,  it  may  be  much  more  efficient  than  using  an  /7-input  LUT. 
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6. 


Address  Encoders/Segment  Index  Encoders  (SIEs) 


Address  encoders  are  used  in  NFGs  as  Segment  Index  Encoders  (SIEs)  for  NFGs 
with  non-uniform  segmentation.  They  determine  in  which  segment  an  input  variable  x 
lies,  and  thus  determines  the  memory  location  of  the  coefficients  used  in  NFG 
calculations.  The  inputs  to  the  encoder  may  be  all  or  just  some  of  the  bits  of  the  input 
variable  x.  It  is  much  more  difficult  to  estimate  hardware  utilization  and  propagation 
delay  for  an  SIE,  because  the  size  depends  on  two  variables:  the  number  of  input  bits,  n, 
and  the  number  address  lines  for  the  coefficients  table,  k.  The  SIE  is  referred  to  as  an  n:k 
SIE. 


a.  Architectures 

The  most  generic  address  encoder  is  shown  in  Figure  33.  SIEs  are  not 
required  for  NFGs  that  use  constant  width  segmentation  because  appropriate  bits  of  x  can 
be  used  as  address  lines  to  the  coefficient  memory  [12].  For  NFGs  with  non-uniform 
segmentation,  the  number  of  segments  required  ,smin  is  determined  by  segmentation 

algorithms.  Segmentation  algorithms  take  into  account  the  function  being  realized  by  the 
NFG,  the  number  of  system  bits,  and  the  required  accuracy  of  the  system.  They  return 
the  number  of  segments  .smjn  and  the  appropriate  coefficients  to  be  stored  in  the  NFG’s 
coefficient  table.  The  architecture  of  the  Virtex-II  requires  memory  sizes  to  be  a  power 
of  2  when  using  BRAMs.  Thus  a  particular  NFG  should  use  5  =  2k  segments,  where  k  is 
the  number  of  address  lines  to  the  coefficient  memory,  andk  =  |~log2  smin]  .  A  detailed 
discussion  about  segmentation  methods  can  be  found  in  [5]. 


Figure  33  Generic  Address  Encoder. 
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A  generic  address  encoder  requires  at  most  k  //-input  functions,  for  an  /?- 
bit  wide  x.  For  most  common  NFGs,  this  generic  method  would  consume  an  enormous 
amount  of  hardware.  The  size  of  an  /7-bit  function  is O (2"j,  thus  a  generic  address 

encoder  built  in  this  manner  would  beO(2"  |~log2  .  Consider  an  NFG  with  a  16-bit 

input  x  that  requires  5=1024  segments,  or  k  =10.  HUandDelay  estimates  that  each  16- 
input  function  uses  2.78%  of  the  BRAMs.  This  means  that  the  SIE  requires  27.8%  of  the 
BRAMs.  Now  consider  an  NFG  with  a  24-bit  input  and  the  same  number  of  segments. 
The  number  of  BRAMs  required  per  function  is  711.1%  of  the  total  available  BRAMs. 
Therefore,  10  functions  require  7111%  of  the  BRAMs.  In  fact,  an  NFG  with  1024 
segments  cannot  be  implemented  on  the  Virtex-II  XC2V6000  unless  x  is  less  than  18  bits 
long.  Implementing  a  general  address  encoder  using  a  SOP  structure  is  impractical  as 

well,  since  the  worst-case  number  of  required  slices  is2”  2-  —  .  An  SOP  for  a  16-bit 

input  single-bit  output  requires  193.9%  of  the  slices  on  a  Virtex-II  XC2V6000  FPGA. 

Since  it  is  impractical  to  construct  a  reasonably  large  SIE  from  k  /7-input 
functions  or  even  from  a  SOP  architecture,  it  is  better  to  estimate  general  SIEs  using  LUT 
cascades  [10][12][14][15].  LUT  cascades  require 2k+l  x  k(n  i- k)  memory  bits,  where 
k  =  |~ log2  ,smin  ]  and  .smm  is  the  number  of  segments.  The  savings  in  hardware  comes  from 
the  size  being  0(n),  instead  of  0(2"j  for  a  general  /7-input  ^-output  function.  The 
general  architecture  of  a  LUT  cascade  is  shown  in  Figure  34. 
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Figure  34 


LUT  Cascade  Architecture.  (From:  [10][11]) 


The  number  of  inputs  into  each  LUT  in  the  LUT  cascade  are  k+2,  the 
number  of  rails  is  equivalent  to  the  number  of  address  lines,  k.  This  architecture 
n-k 


requires 


(£+2)-input  Uoutput  LUTs  [11].  The  function  HUandDelay  calculates 

n-k 


the  propagation  delay  of  the  LUT  cascade  by  cascading 


(£+2)-input  LUTs. 


Because  k  LUTs  are  in  parallel,  the  net  delay  is  adjusted  because  the  fanout  of  the  SIE  is 
equal  to  the  fanout  of  each  LUT  multiplied  by  k. 


Hardware  Utilization  for  SIEs 


Delays  for  SIE 


(a)  HUP  (b)  Delay 

Figure  35  FIUP  and  Delay  for  LUT  Cascades  vs.  k  for  Various  n. 
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b.  Complexity  Analysis 


The  author’s  function  HUandDelay  returns  hardware  utilization 
parameters  based  on  unknown  functions.  Therefore,  the  best  general  designs  are  used  to 
determine  complexity.  Since  LUT  cascades  require  less  hardware  than  SOPs  and  large 
LUTs,  HUandDelay  uses  the  architecture  described  above  for  LUT  cascades  to  estimate 
the  complexity  of  an  SIE. 

c.  Delay  Analysis 

LUT  cascades  also  exhibit  shorter  propagation  delays  for  general  SIE 
functions  than  from  the  other  architectures  previously  discussed.  Therefore,  the 
propagation  delay  estimated  by  HUandDelay  is  based  on  that  of  a  LUT  cascade. 

7.  Block  RAM  (BRAM)  and  Other  Memory 

Memory  is  utilized  within  NFGs  for  storing  and  retrieving  coefficients  for  the 
approximation  technique.  Using  a  ROM  as  described  above  is  the  simplest  way  to  get  an 
//-bit  addressable  memory,  but  it  may  not  be  the  fastest.  The  Xilinx  FPGA  includes 
18Kbit  BRAM  units  which  can  accomplish  the  same  goals  with  a  smaller  time  delay.  For 
most  NFG  applications,  writing  to  memory  is  not  required.  Using  the  BRAMs  in  read¬ 
only  mode  can  significantly  reduce  the  delay  when  compared  to  using  LUTs  or 
distributed  RAM.  Other  circuit  designs  may  utilize  external  RAMs  but  since  there  are  a 
wide  variety  of  them,  it  is  not  feasible  to  make  estimations  on  them  all.  For  this  reason, 
external  RAMs  are  not  analyzed  in  this  thesis. 

a.  Architecture 

BRAM  is  included  on  the  Virtex-II  and  is  one  the  main  resources 

discussed  throughout  this  thesis.  It  provides  a  relatively  large  block  of  memory  with  fast 

connections  to  surrounding  hardware,  including  the  MULT  18x1 8s.  The  downside  of 

using  the  BRAM  is  that  there  are  a  limited  number  of  them  (Table  1),  and  the  circuit 

adjoining  the  block  must  be  arranged  close  to  the  BRAM  in  order  to  minimize  the  routing 

delay.  Also,  if  the  desired  amount  of  RAM  is  less  than  that  contained  in  one  block,  then 
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the  rest  of  the  block  is  wasted.  Thus,  unless  BRAM  is  used  with  at  least  18Kbits,  then 
hardware  is  wasted.  Two  BRAMs  in  parallel  combined  with  a  2:1  MUX  form  a  36Kbit 


RAM.  Thus,  the  number  of  BRAMs  used  is 


and  the  number  of  levels  of  2:  IMUXs 


is  log2 


=  n  - 14  .  The  overall  delay  is  the  sum  of  the  delay  from  the  BRAM  plus  the 


delay  of  the  MUX  network  required  to  implement  the  /7-bit  address  RAM. 


Although  each  BRAM  can  have  at  most  14  address  bits,  they  can  be 
configured  to  use  fewer  address  bits.  Using  fewer  address  bits  allows  the  BRAM  to 
contain  more  than  1-bit  per  memory  location.  Table  5  summarizes  the  possible  BRAM 
configurations.  This  thesis  compares  BRAM  usage  for  various  NFG  configurations  using 
1-bit  port  data  width.  The  BUP  is  dependent  on  the  number  of  address  bits,  n  (shown  in 
“ADDR  Bus”  column  in  Table  5),  and  the  word  width,  w  (“Port  Data  Width”  column  in 
Table  5).  The  number  of  memory  bits  stored  is  5  x  w  =  2"  x  w  and  is  constant,  where  s  is 
the  number  of  segments  required  by  the  NFG  and  n  is  the  number  of  address  lines.  Thus, 
when  n  is  increased,  w  becomes  smaller. 


Table  3-12:  Port  Aspect  Ratio 


Port  Data  Width 

Depth 

ADDR  Bus 

Dl  Bus  /DO  Bus 

DIP  Bus  /  DOP  Bus 

1 

16,384 

<13:0> 

<0> 

NA 

2 

8,192 

<12:0> 

<1:0> 

NA 

4 

4,096 

<11:0> 

<3:0> 

NA 

9 

2,048 

<10:0> 

<7:0> 

<0> 

18 

1,024 

<9:0> 

<15:0> 

<1:0> 

36 

512 

<8:0> 

<31:0> 

<3:0> 

Table  5  Virtex-11  BRAM  Configurations  for  Single-port  RAMs.  (From  [18]) 
b.  Complexity  Analysis 

Since  there  are  multiple  ways  to  configure  the  BRAM  for  various  word 
widths,  HUand Delay  determines  the  number  of  bits  of  memory  required  by  the 
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equation  2k  x  vv .  The  number 


#  of  memory  bits  required 

2k  xw 

#  of  memory  bits  per  BRAM 

16384 

of  BRAM  blocks  required  is 
The  required  BRAM  blocks  are  multiplexed 


together  with  a 


2k  x  w 

16384 


:  1  MUX. 


HUandDelay  calls  itself  recursively  to  obtain  the 


hardware  utilization  parameters  for  the  MUX.  It  returns  the  total  hardware  utilization 
parameters  by  summing  the  two.  Note  that  there  will  be  some  wasted  hardware  (MUXs) 
if  the  number  of  BRAM  blocks  is  not  a  power  of  2,  but  the  BRAMs  are  not  wasted. 


c.  Delay  Analysis 

Analyzing  the  delay  is  somewhat  more  difficult  for  BRAMs,  since  they 
are  actually  synchronous  circuits  and  every  other  circuit  studied  so  far  has  been 
combinatorial.  This  thesis  looks  at  combining  different  arithmetic  devices  in  series  to 
detennine  the  overall  NFG  propagation  time.  It  does  not  take  into  account  setup  times 
and  hold  times  that  a  sequential  circuit  would.  For  the  purposes  of  this  thesis,  the  delay 
of  a  BRAM ,tpwPtBRAM,  is  defined  as  tprop  BRAM  =tNET+tBCKO,  where  the  net  delay  depends 

on  the  fanout,  and  tBCK0  is  the  delay  from  the  time  the  clock  signal  transitions  to  the  time 
when  the  output  data  bits  are  valid.  In  this  situation,  we  assume  the  address  bits  to  the 
BRAM  are  stable  when  the  clock  undergoes  a  transition.  HUandDelay  uses  the  equation 
above  to  compute  the  propagation  delay  for  the  BRAM  as  a  pseudo-combinational  delay. 
For  memories  that  require  more  than  one  BRAM,  they  are  combined  with  an  appropriate¬ 
sized  MUX.  HUandDelay  also  accounts  for  the  required  MUX  delay. 


E.  VISUALLY  REPRESENTING  COMPLEXITY  AND  PROPAGATION 
DELAY 


Various  m-files  are  used  to  plot  HU-Delay  Graphs,  which  visually  represent  the 
components.  These  graphs  make  it  easy  to  compare  components  versus  size  and  delay  at 
the  same  time,  and  also  compare  different  components  to  see  which  ones  take  up  more 
space.  The  delay  axis  of  the  HU-Delay  Graph  represents  the  timeline  on  which  the  signal 
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propagates  through  a  component,  or  through  multiple  components.  The  HUP  (vertical) 
axis  is  the  measure  of  hardware  that  is  utilize  for  a  particular  component  or  components. 

The  author’s  MATLAB  functions  HUPBoxes.m  and  boxesOrigin.m  both 
produce  HU-Delay  graphs.  However,  boxesOrigin.m  keeps  the  bottom-left  comer  of 
each  component  centered  at  the  origin,  while  HUPBoxes.m  arranges  the  components 
based  on  their  dependency  relationships. 

1.  Comparing  the  Same  Components  with  Different  Sizes 

The  HU-Delay  Graphs  can  be  helpful  when  comparing  a  specific  component 
versus  size.  Figure  36  compares  adders  at  various  word-widths  using  the  function 
boxesOrigin.m.  For  example,  the  delay  for  a  64-bit  adder  is  approximately  5.8  ns  and  it 
uses  approximately  0.032%  of  the  hardware. 


HU-Delay  Graph  for  Adders  of  Various  Sizes 


Delay  (ns) 

Figure  36  HU-Delay  Graph  of  Adders  with  Various  Word-widths. 

2.  Comparing  Arithmetic  Components  with  the  Same  Number  of  Input 
Bits 

Figure  37  shows  several  different  components  with  the  same  word  width.  Notice 
that  a  ROM  built  from  CLBs  with  18  address  lines  takes  up  the  most  space  and  has  the 
longest  delay,  whereas  the  18-bit  Barrel  Shifter  takes  the  least  time  and  least  hardware. 
This  type  of  comparison  is  useful  when  comparing  two  candidate  components  for  a 
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particular  NFG.  For  example,  consider  possible  NFG  architectures  for 
implementing /(x)  =  x2 .  One  could  use  an  18-bit  by  18-bit  unsigned  multiplier,  while 
another  could  simply  use  BRAM  with  a  total  of  18  address  lines.  The  HU-Delay  graph  in 
Figure  37  shows  the  comparison  between  the  two.  Notice  that  there  are  tradeoffs  to 
consider.  Using  the  BRAM  is  faster,  but  using  the  multiplier  requires  less  hardware. 


HU-Delay  Graph  for  Various  Components  with  n=18 


Delay  (ns) 

Figure  37  HU-Delay  Graph  of  Several  18-bit  Components. 

3.  Multiple  Components  in  Series 

Generally,  NFGs  contain  multiple  cascaded  components.  Linear  NFGs  provide  a 
good  example  where  the  components  are  in  series,  that  is,  each  component  must  wait 
until  the  previous  component  has  completed  its  computation  prior  to  initiating  its  own 
computation.  Figure  38  shows  an  example  of  a  linear  NFG  with  non-uniform 
segmentation  using  the  function  HUPBoxes.m.  The  bottom-left  comer  of  each 
component  is  anchored  on  the  delay  axis  at  the  end  of  the  delay  of  the  previous 
component.  In  the  example,  the  adder  must  wait  until  the  barrel  shift  operation  is 
complete;  the  barrel  shifter  must  wait  until  the  multiplier  is  finished;  and  so  on.  Notice 
that  the  hardware  utilization  for  each  component  can  be  read  off  of  the  HUP  axis  from  the 
top  of  each  respective  box.  For  example,  the  multiplier  takes  up  roughly  0.95%  of  the 
FPGA  hardware  and  the  SIE  takes  up  roughly  0.7%.  The  delay  for  each  component  is  the 
width  of  its  associated  box.  Thus,  the  SIE  takes  roughly  12ns  to  complete,  while  the 

58 


BRAM  takes  3  to  4ns.  The  HU-Delay  graph  easily  shows  relative  hardware  utilization 
and  delays  for  all  of  its  components  simultaneously. 


HU-Delay  Graph  for  Various  Components  in  Series 


Propagation  delay  (ns) 

Figure  38  HU-Delay  Graph  of  Various  Components  in  Series. 

4.  Multiple  Devices  in  Parallel 

In  some  NFGs,  calculations  can  be  done  in  more  than  one  arithmetic  component 
at  the  same  time.  The  example  in  Figure  39  shows  the  exact  same  components  that  are  in 
Figure  38,  but  they  are  arranged  in  a  parallel  configuration.  This  view  allows  easy 
detection  of  which  component  takes  the  longest  time  to  propagate.  It  also  makes  it  easy 
to  see  the  total  hardware  utilization  for  the  NFG. 

HU-Delay  Graph  for  Various  Components  in  Parallel 

2.5  | - , - , - j - 


Propagation  delay  (ns) 


Figure  39  HU-Delay  Graph  of  Various  Components  in  Parallel. 
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5. 


Multiple  Devices  in  Series/Parallel  Configurations 


The  previous  component  configurations  demonstrate  relatively  simple  NFG 
architectures,  but  efficiently  designed  NFGs  require  multiple  arithmetic  components  in  a 
series/parallel  combination.  Creating  HUP-Delay  graphs  for  more  complex  NFGs  is  not 
as  simple  as  the  previously  mentioned  configurations.  In  order  to  combine  multiple 
components,  it  is  necessary  to  know  what  components  depend  on  the  result  from  other 
components. 

The  “dependency”  matrix  D  is  a  square  matrix  that  contains  the  dependency 
relationships  for  all  of  the  components  in  a  particular  NFG.  Each  row  corresponds  to  the 
particular  component  in  the  NFG.  For  a  given  NFG,  let  k  be  the  number  of  components 
in  the  NFG.  Thus  D  is  a  k  x  k  matrix.  Let  p  represent  the  index  into  the  list  of 
component  names,  where  1  <  p  <  k  .  A  particular  component  p  depends  on  another 
component  77  iff  D  ^  0  .  Figure  40  shows  an  example  of  a  simple  NFG  where  device  2 

depends  on  device  1,  and  device  3  depends  on  device  2.  The  function  HUPBoxes.m 
uses  the  dependency  matrix  to  arrange  components  in  series  and/or  parallel.  If  a 
particular  component  is  dependent  on  another  component  completing  its  computation, 
then  it  is  said  to  “depend”  on  that  component.  This  is  particularly  useful  when 
constructing  NFGs  where  the  multipliers  require  an  output  from  the  memory  before  it  can 
begin  its  computation.  Thus  an  overall  delay  can  be  assessed  if  components  operate  in 
parallel.  Since  components  can  depend  on  more  than  one  other  component,  HUPBoxes 
places  the  component  in  series  with  the  component  which  finishes  the  latest,  thereby 
computing  the  longest  path  delay.  HUPBoxes  disregards  data  in  the  upper  right  sector 
triangle  to  prevent  circular  dependencies. 
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Figure  40  Example  of  a  Dependency  Matrix  D. 


More  complex  NFGs  are  shown  in  Figure  41a  and  Figure  41b.  The  NFG  in 
Figure  41a  shows  an  NFG  whose  multiplier  and  BRAM  both  depend  on  the  SIE.  The 
barrels  shifter  depends  on  both  the  multiplier  and  the  BRAM,  and  therefore  must  wait 
until  both  of  them  have  completed  their  computation.  Since  the  multiplier  takes  longer, 
then  the  barrel  shifter  starts  after  the  multiplier  is  done.  In  the  example  in  Figure  41b,  the 
barrel  shifter  depends  on  the  BRAM  and  not  the  multiplier,  thus  it  can  operate  in  parallel 
with  the  multiplier.  Also  notice  that  the  adder  must  wait  on  the  multiplier. 


HU-Delay  Graph  for  Various  Components  in  Series-Parallel 

1.5  | - , - , - , - , - 


HU-Delay  Graph  for  Various  Components  in  Series-Parallel 

1.5  | - , - , - , - , - , - , - 
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(a)  Barrel  Shifter  Depends  on  Multiplier  (b)  Barrel  Shifter  Depends  on  BRAM 

Figure  41  HU-Delay  Graph  of  Series-Parallel  Composite  Device. 


F.  CHAPTER  SUMMARY 


This  chapter  shows  how  various  arithmetic  and  logic  components  (such  as 
multipliers  and  coefficient  tables)  can  be  built  from  the  resources  on  the  Virtex-II  FPGA 
(CFBs,  BRAMs,  and  MUFTI 8x1 8s).  It  defines  terminology  for  measuring  the  usage  of 
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each  resource  to  be  used  in  comparing  components  and  NFGs.  This  chapter  also  shows 
how  simulation  results  are  collected  and  how  fillLin  is  used  to  estimate  missing  data 
points.  This  allows  relatively  accurate  complexity  and  delay  estimations  for  components 
that  were  not  simulated.  The  hardware  utilization  and  delay  estimations  for  the 
components  computed  by  HUand Delay  are  validated  in  this  chapter.  The  following 
chapter  organizes  several  components  into  specific  NFG  models,  using  the  complexity 
and  delay  estimations  for  each  component  to  produce  complexity  and  delay  estimations 
for  each  entire  NFG.  Not  all  of  the  components  in  this  chapter  are  used  in  the  models  in 
Chapter  IV.  For  example,  MUXs  and  barrel  shifters  are  not  used.  They  were  analyzed  in 
anticipation  of  alternative  NFG  models.  Future  work  might  explore  the  benefits  of  using 
barrel  shifters  instead  of  multipliers  in  NFGs.  The  next  chapter  describes  eight  NFG 
models  that  are  commonly  described  in  other  resources  [8] [  1 1][12]. 
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IV.  CONSTRUCTING  MODELS  FOR  CURRENT  NFG 

ARCHITECTURES 


This  chapter  outlines  how  models  are  constructed  to  accurately  represent 
particular  NFGs.  The  models  below  are  simple  examples  of  what  can  be  constructed 
from  the  basic  components  listed  in  Table  4.  The  tenn  “component”  is  used  throughout 
this  thesis  to  refer  to  a  basic  arithmetic  device  that  is  used  within  an  NFG.  For  example, 
the  components  of  an  LUB  NFG  are  a  ROM,  a  multiplier,  and  an  adder.  The  models  use 
simple  assumptions  and  estimates  to  reduce  the  number  of  variables  determining  the 
complexity  and  delay  of  a  particular  NFG. 

A.  NFG  MODEL  CONSTRUCTION  AND  USAGE 

The  models  in  this  chapter  produce  FIUP  and  delay  estimations  based  on  two 
known  variables:  the  system  size,  n,  and  the  number  of  required  segments,  s.  The  model 
input  variable  n  determines  the  width  of  the  arithmetic  components  and  contributes  in 
estimating  the  SIE  if  it  is  required.  The  input  variable  s,  along  with  the  size  of  the  word 
stored  in  memory  w,  determine  the  size  of  the  required  memory.  It  also  contributes  to 
determining  the  size  of  the  SIE  (if  required).  Each  model  defines  w  based  on  the 
architecture  and  n. 

This  allows  any  particular  model  to  be  independent  of  a  particular  function. 
Generally,  5  depends  on  n,  but  the  models  do  not  calculate  a  value  for  5.  Each  model  is 
only  based  on  the  particular  NFG  architecture  and  the  required  memory  size.  The 
architecture  provides  the  type  and  quantity  of  arithmetic  components  required,  the  sizes 
of  each  component,  and  the  dependency  relationship  between  the  components.  For  each 
component  in  the  NFG,  the  models  (i.e.  model_*.m)  call  the  author’s  function 
HUand Delay. m  to  retrieve  the  SUP,  MUP,  BUP,  and  delay.  For  example,  if  an  NFG 
architecture  requires  a  17-bit  adder,  then  the  model_*  file  calls  HUand  Delay,  which 
returns  the  parameters  for  a  17-bit  adder.  Each  model  then  assembles  a  matrix  C 
containing  the  HUPs  and  delays  for  all  of  its  components.  A  corresponding  dependency 
matrix  D  is  also  constructed  within  each  model  based  on  the  dependency  relationship 
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characterized  by  the  architecture.  These  matrices  are  passed  together  with  a  list  of 
component  names  into  either  HUPboxes.m  or  totalHUPandDelay.m.  Both  functions 
return  the  total  HUP  and  delay  along  the  worst  case  delay  path.  The  only  difference 
between  the  two  functions  is  that  the  latter  does  not  produce  an  HU-Delay  Graph. 

A  feature  of  the  MATLAB  code  file  architecture  is  that  any  hardware 
configuration  can  be  implemented  as  long  as  it  uses  the  basic  components  in 
HUand Delay. m.  Any  of  the  models  can  realize  any  function,  as  long  as  the  number  of 
segments  is  known.  In  fact,  for  the  same  architecture,  the  only  difference  between  an 
NFG  realizing  / (x)  and  one  realizing  g(x)  is  the  set  of  coefficients  stored  in  memory. 
The  number  of  coefficients  is  proportional  to  the  number  of  segments,  which  depends  on 
the  properties  and  domain  of  the  function  being  realized  by  the  NFG.  Therefore,  the  size 
of  the  memory  and  SIE  (if  required)  depend  on  the  function  realized  on  the  NFG.  But 
again,  the  only  inputs  into  HUandDelay.m  are  s  and  n. 

B.  ESTIMATING  THE  APPROPRIATE  SIZE  FOR  COMPONENTS 

To  make  accurate  size  and  delay  estimations  for  NFGs,  it  is  imperative  that  the 
estimates  for  its  components  be  accurate  as  well.  This  section  describes  the  assumptions 
made  in  order  to  produce  a  few  common  NFG  architectures. 

1.  Estimating  the  Memory  and  SIE  Sizes 

Memory  and  SIE  sizes  are  based  on  the  number  of  segments,  s,  required.  The 
number  of  segments  depends  on  the  function,  the  function  interval,  the  type  of  NFG 
(linear,  quadratic,  or  other)  and  the  precision  of  the  number  system.  The  function 
segments. m  calculates  the  number  of  segments  required  when  given  a  function, 
interval,  and  number  system  size.  It  assumes  the  allowable  error  is  s  =  2  "  1 .  Higher 
accuracies  require  more  segments. 

The  number  of  segments  has  been  detennined  for  several  functions  and  for  a  few 
commonly-used  precisions  [4][8][1 1][13][20].  Some  of  the  data  has  been  collected  from 
experimental  data  and  some  has  been  calculated  with  asymptotic  approximations. 
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Relevant  data  is  combined  together  in  0.  The  data  can  be  useful,  but  it  only  provides  data 
for  three  values  ofs.  It  does  not  provide  a  general  formula  for  various  architecture  sizes. 
0  shows  the  number  of  segments  required  for  linear  unifonn  (LU),  linear  non-uniform 
(LN),  quadratic  uniform  (QU),  and  quadratic  non-uniform  (QN)  NFGs  of  various  n-bit 
systems.  Here,  “uniform”  and  “non-uniform”  refer  to  the  segmentation  type. 


# 
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5 

log2(x) 

[1.2] 

109 
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13 
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64 

44 

27833 

19097 

506 

351 
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91 
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[0,2] 

Notel 

449 
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54 
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5103 
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Note  1.  Data  not  available  for  these  NFGs. 

Table  6  Function  Suite  Including  the  Number  of  Segments  for  LN,  LU,  QU,  and  QN 

NFGs.  (After  [20]  [4]) 
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The  minimum  segment  width  for  a  linear  NFG  is  cr  '''n  =  4 


/(2V ) 


where  x*  is 


the  value  at  which  fr\x)  is  maximum  [5],  For  a  quadratic  NFG,  =  4 

A 


3  s 


/(3V)  ■ 


Thus  for  NFGs  with  unifonn  segmentation,  the  number  of  segments  can  be  detennined 
by  dividing  the  domain  of  the  NFG  by  the  smallest  segment  width.  Therefore 
b  —  ci 

smm  = - ’  where  [<r/,6]  is  the  domain  of  the  NFG.  For  NFGs  with  non-uniform 

<j 

min 

segmentation,  it  is  more  complicated  to  detennine  the  number  of  segments.  The  number 
of  segments  for  a  linear  NFG  with  non-uniform  segmentation  is 

5mki  =5(^)  0  J  -\j\f(2) (x*)\dx  ■  This  is  derived  in  [4],  We  also  consider  the  number 

of  segments  for  a  quadratic  NFG  with  non-unifonn  segmentation  to  be  given  by  the 


analogous  equation  s 


QN  _ 


■w° 


1 


I  {pool* 


(after  [4]).  This  has  not  yet  been 


4  & 

proven,  but  is  shown  to  be  accurate  by  correlating  it  with  experimental  segmentation 
methods. 


The  author’s  m-file  segments. m  uses  MATLAB’s  symbolic  toolbox  to  calculate 
the  derivatives  above.  It  then  substitutes  values  for  the  interval  \a,b\  and  the  number  of 
bits,  n  to  calculate  the  maximum  segment  width  crmax .  Since  MATLAB’s  symbolic 

toolbox  cannot  compute  the  exact  integrals  for  some  of  the  more  complicated  functions 
(especially  those  using  the  absolute  value),  a  numerical  integration  using  a  trapezoidal 
approximation  [24]  is  implemented.  The  numerical  integral  approximation  was 
compared  to  the  symbolic  integrations  (of  those  able  to  be  integrated),  yielding  the  exact 
same  results.  The  number  of  segments  calculated  with  these  equations  matches  most  of 
the  data  in  0,  confirming  that  it  accurately  calculates  the  number  of  required  segments  for 
a  QN.  Table  7  shows  the  results  of  the  calculations.  The  values  that  do  not  exactly 
match  those  in  0  are  noted  below  Table  7.  Although  there  are 
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small  differences,  they  are  all  relatively  accurate.  Also,  since  k  =  \ log2  ,smm  ] , 

where  k  e  □  ,  the  actual  number  of  segments  being  implemented  is  rounded  up  to  the 
nearest  power  of  two. 

Numerical  integration  allows  us  to  integrate  any  function  as  long  as  it  is 
continuous  and  bounded  over  the  given  interval.  Since  it  is  used  to  calculate  the  integral 
of  a  2nd  or  3rd  order  derivative,  we  must  ensure  that  the  original  function  being 
implemented  on  the  NFG  is  twice  or  thrice  differentiable,  for  linear  or  quadratic  NFGs, 
respectively.  This  makes  sense,  since  if f(x)  is  a  linear  function,  then  it  is  implemented 
exactly  with  a  linear  NFG.  Its  2nd  derivative  is  0;  the  integral  of  which  is  also  0.  This 
yields  a  segment  width  of  qo,  and  0  segments.  The  function  segment. m  allows  any 
function  input  in  the  form  of  a  string  (for  example,  ‘exp(x)’).  The  function  must  be 
recognized  by  the  functions  in  MATLAB  and  must  be  a  single-variable  function  of  x. 
The  domain  [ a,b]  for  the  NFG  is  also  input  to  yield  the  number  of  require  segments,  ,smm  . 

The  function  segments. m  estimates  the  number  of  segments  ,smin  for  LU,  LN, 
QU,  and  QN  NFGs  in  a  single  function  call.  From  this,  each  model  detennines  the 
number  of  address  lines  associated  with  its  required  coefficients  table,  k  =  [log,  smin]  . 

These  are  needed  to  determine  the  size  of  the  memory  and  the  size  of  the  SIE  for  the 
NFG  that  realizes  the  specific  function  over  a  specific  interval  \a,b~\.  The  HUP  and  delay 
of  the  most  compact  memory  unit  is  returned  by  calling  H  U a n d  D e  1  a y ( A: ,  ’  M e m  ’ ,  vv ) ,  where 
w  is  the  width  of  the  word  stored  at  each  memory  location.  Each  model  with  non- 
uniform  segmentation  also  requires  an  n:k  SIE. 
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1.  Slightly  different  from  0,  but  there  is  no  difference  in  implemented  memory  sizes. 

2.  Different  from  0,  resulting  in  an  additional  address  line  to  the  implemented  memory. 

3.  New  results. 


Table  7  Number  of  Segments  Based  on  Proven  [5]  and  Assumed  Equations. 

2.  Estimating  Multiplier  Size 

The  goal  of  this  thesis  is  to  estimate  general  NFG  complexity  and  delay  without 
having  to  perform  a  lengthy  synthesis.  The  multipliers  analyzed  in  HUand Delay  are  77- 
bit  by  77-bit  multipliers  whose  product  is  277-bits  in  length.  Some  NFG  designs  may 


68 


require  /?-bit  by  m-bit  multipliers,  where  m  ^  n  .  To  save  all  data  bits,  the  product  must 

n  +  m 

contain  n+m  bits.  In  these  cases  — — —  -bit  multipliers  are  used  because  their 

complexities  are  slightly  more  than  multipliers  optimized  for  specific  n  and  m  value. 
This  estimate  provides  a  worst  case  estimate  for  a  multiplier.  Multiplier  complexity  can 
also  be  reduced  by  neglecting  some  of  the  output  bits.  For  example,  some  NFG  designs 
may  simply  require  an  //-bit  multiplier  with  an  /7-bit  product.  Again,  a  full  /7-bit 
multiplier  is  substituted,  representing  a  worst  case  multiplier  size. 

3.  Estimating  Adder  Size 

The  adders  analyzed  in  this  thesis  have  two  /7-bit  inputs,  and  produce  an  /7-bit 
sum.  However,  quadratic  NFGS  often  require  multiple-input  adders.  Since 
HUand Delay  does  not  provide  information  on  multiple-input  adders,  the  models  in  this 
thesis  use  adders  in  series.  Also,  when  two  inputs  are  different  sizes,  the  adder  uses  the 
larger  of  the  two  sizes.  Figure  42  shows  an  example  of  a  3-input  adder  with  (/n+l)-bit 
and  (/?  +  l)-bit  inputs  where  m  >  n  . 
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y[m:0] 

i 

y[m:0] 

Figure  42  Using  Two  2-Input  Adders  to  Realize  a  3-input  Adder. 

4.  Estimating  Other  Components  Not  Analyzed  by  HUand  Delay 

NFGs  may  require  additional  arithmetic  components  that  are  not  analyzed  by 
HUandDelay.m.  For  functions  with  few  inputs  (n=  1  to  7  bits)  LUTs  can  be  used  to 
realize  a  general  function.  This  may  be  applicable  to  NFGs  that  incorporate  special 
number  handling,  or  signed  number  manipulation.  It  might  also  be  efficient  to  use  a  SOP 
implementation.  The  models  in  this  thesis  do  not  require  special  hardware. 


69 


c. 


MODELS  FOR  COMMON  NFG  ARCHITECTURES 


The  models  described  in  this  section  are  summarized  in  Table  8.  They  have  been 
developed  from  architectures  in  [8] [1 1][12].  Appendix  A.l  shows  how  to  use  the  models 
to  obtain  desired  data  and  plot  HU-Delay  Graphs. 

1.  Basic  Linear  NFGs 

Basic  linear  NFGs  approximate  f(x)  with  s  equations  in  the  form 
yf(x)  =  cux  +  c0i  ,  where  i  eU  and  1  <  /  <  s  .  The  constants  cu  and  c0;  are  stored  in 

memory  or  in  LUTs.  The  sizes  of  the  components  in  the  basic  NFG  architectures  are  the 
minimum  required  sizes  such  that  no  bits  are  truncated  or  rounded.  For  example,  a 
multiplier  with  2  n-bit  inputs  produces  a  product  that  has  2/7-bits.  The  architectures  are 
shown  in  Figure  43.  The  HU-Delay  graphs  in  Figure  44  shown  examples  of  basic  linear 
NFGs  realizing  f(x)  =  4x  on  [1,2], 


x[n-1 :0] 


x[n-1:0] 


y[n- 1:0]  y[n-1:0] 

(a)  Uniform  Segmentation  (b)  Non-Uniform  Segmentation 

Figure  43  Basic  Linear  NFG  Architectures.  (After  [12]) 
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Total  HUP  =  0.311%,  Total  Propagation  Delay  =  17.39  ns.  Total  HUP  =  0.3643%,  Total  Propagation  Delay  =  21.67  ns. 


Propagation  delay  (ns)  Propagation  delay  (ns) 

(a)  LUB  (b)  LNB 

Figure  44  HU-Delay  Graphs  for  LUB  and  LNB  NFGs  realizing  / (x)  =  \fx  on  the 

interval  [1,2]  with  n= 16. 

a.  Uniform  Segmentation 

The  architecture  for  a  basic  linear  NFG  with  unifonn  segmentation  (LUB) 
is  shown  in  Figure  43a.  It  requires  a  2 k  xw  memory,  an  /7-bit  multiplier,  and  a  2/7-bit 
adder.  This  architecture  requires  two  coefficients  to  be  stored  in  memory  for  each 
segment.  Thus,  w  =  2n.  The  number  of  segments  is  determined  by  the  segments. m, 
and  the  number  of  address  lines  required  for  the  coefficients  table  is  k  =  [~ log,  5]  .  The 
multiplier  requires  a  coefficient  cu  from  the  memory.  Thus,  computing  cu  can  only 

occur  after  a  memory  read  has  been  completed.  Likewise,  the  adder  must  wait  until  the 
multiplier  has  completed  its  computation.  Thus,  the  adder  depends  on  the  multiplier. 
This  dependency  is  shown  in  the  dependency  matrix  shown  in  Figure  45. 

b.  Non-uniform  Segmentation 

The  basic  linear  NFG  with  non-uniform  segmentation  is  referred  to  as  the 
LNB.  The  only  difference  between  architecture  with  non-uniform  versus  uniform 
segmentation  is  that  the  non-uniform  architecture  also  requires  an  n:k  SIE.  The  memory 
must  wait  for  the  SIE  to  complete  its  address  computation  before  the  memory  can  begin 
to  look  up  the  coefficients.  The  dependency  is  also  shown  in  Figure  45.  In  general,  non- 
uniform  architectures  require  fewer  segments.  Thus,  k  is  smaller  than  that  of  a  similar 
architecture  with  uniform  segmentation. 
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(a)  Uniform  Segmentation  (b)  Non-uniform  Segmentation 

Figure  45  Dependency  Matrices  for  Basic  Linear  NFGs. 


To  implement  a  specific  function  with  a  basic  linear  NFG,  the  user  must 
call  the  function  model_Linear_Uniform_Basic  or  the  function 
model_Linear_NonUniform_Basic  with  the  size  of  the  number  system  (n)  and  the 
number  of  segments  (smin).  The  author’s  MATLAB  in-file  segments. m  returns  the 

number  of  segments  required  based  on  the  proofs  in  [4]  and  a  system  error  e  =  2  "  1 . 


2.  Compact  Linear  NFGs 


Compact  linear  NFG  architectures  are  shown  in  Figure  46  for  both  unifonn  and 
non-unifonn  segmentation.  FIU-Delay  graphs  are  shown  in  Figure  47  for  f(x)  =  \[x  on 
[1,2].  They  compute  the  function  y  =  cu  (jc-s\)  +  /  (s. )  +  v. .  These  types  of  NFGs  can  be 


used  to  reduce  the  size  of  the  arithmetic  components.  This  often  reduces  the  delay  and 
sometimes  the  hardware  utilization  for  the  NFG.  They  do  not  always  reduce  the  overall 
amount  of  hardware  required.  However,  compare  the  architecture  of  the  NFG  in  Figure 
46a  with  the  basic  linear  NFG  in  Figure  43a.  The  multiplier  in  the  compact  NFG  is  a  k- 
bit  by  (n-k)- bit  multiplier,  resulting  in  an  /7-bit  product.  This  thesis  approximates  this 


type  of  multiplier  with  a 


-bit  by 


-bit  multiplier,  which  is  obviously  smaller  than 


the  /z-bit  by  //-bit  multiplier  used  in  the  basic  linear  NFG  above.  Also  the  memory  would 
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only  have  to  store  an  (n+k)- bit  word  for  each  segment  instead  of  a  2 n.  For  the 
architecture  in  Figure  46b,  additional  hardware  is  required  when  compared  to  basic  linear 
NFG  in  Figure  43b:  an  /7-bit  adder  and  an  additional  coefficient  in  memory.  Therefore 
there  is  a  trade-off  to  be  considered.  The  adder  causes  a  relatively  small  delay  and  takes 
up  very  little  hardware.  However,  if  the  number  of  segments  is  large,  then  adding  an 
additional  n- bit  word  for  each  segment  can  become  extremely  costly  in  terms  of  hardware 
utilization. 


In  addition,  the  architectures  below  must  be  analyzed  carefully  for  each  particular 
function  before  determining  which  bits  may  be  truncated  without  loss  of  precision.  Thus, 
q  cannot  be  determined  as  a  generality  even  though  some  specific  architectures  have  been 
analyzed  in  depth  [13].  To  show  general  comparisons,  the  compact  models  in  this  thesis 
n 

use  q  =  — . 

2 


y[/7-1 :0]  y[n-1 :0] 


(a)  Uniform  Segmentation  (b)  Non-Uniform  Segmentation 
Figure  46  Compact  Linear  NFG  Architectures.  (After  [11]) 


|  Memory 
|  Multiplier 


(a)  LUC  (b)  LNC 

Figure  47  HU-Delay  Graphs  for  LUC  and  LNC  Realizing  / (x)  =  Vx  on  the  Interval 

[1,2]  with  n= 16. 
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Models  for  the  LUC  and  LNC  return  HUP  and  delay  by  calling 
model_Linear_Uniform_Compact  and  model_Linear_NonUniform_Compact 
respectively.  The  summary  of  the  components  and  dependency  matrices  for  compact 
linear  NFGs  using  uniform  and  non-uniform  segmentation  methods  (LUC  and  LNC)  are 
shown  in  Table  8. 

3.  Basic  Quadratic  NFGs 

Basic  quadratic  NFGs  approximate  f(x)  with  s  equations  in  the  form 
y  =  c2ix2  +  cux  +  c0j  ,  where  i  e  □  and  1  <  i  <  s  .  The  constants  c2i ,  cXi ,  and  c0i  are  stored 
in  memory  or  in  LUTs.  Like  the  basic  linear  NFGs,  the  sizes  of  the  components  in  the 
basic  quadratic  architectures  are  the  minimum  required  sizes  such  that  no  bits  are 
truncated  or  rounded. 

Basic  quadratic  architectures  are  shown  in  Figure  48  for  NFGs  using  uniform  and 
non-uniform  segmentation.  Each  requires  three  multipliers,  two  adders,  and  a 
coefficients  table  that  contains  three  n-bit  words.  The  NFG  with  non-uniform 
segmentation  also  requires  an  n:k  SIE.  An  n-bit  multiplier  is  used  to  produce  x2 ,  which  is 
a  2/7-bit  product.  To  prevent  truncation  of  any  bits,  a  total  of  two  //-bit  multiplier  and  a 
single  1 .5/7-bit  multiplier  are  used.  In  addition,  the  first  adder  requires  a  2/;-bit  input 
(cux)  and  an  n- bit  input  (c0;  ).  Thus  a  2 n  adder  is  used.  The  2n-bit  sum  ( cXix  +  c0j )  is 

added  to  the  3/7-bit  product  c2ix2  in  a  3/7-bit  adder. 
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x[n-1 :0] 


y[n- 1:0]  y[n-1:0] 


(a)  Uniform  Segmentation  (b)  Non-Uniform  Segmentation 
Figure  48  Basic  Quadratic  NFG  Architectures.  (After  [8]) 


Models  for  the  QUB  and  QNB  return  HUP  and  delay  by  calling 
model_Quad_Uniform_Basic  or  model_Quad_NonUniform_Basic.  A  summary  of 
the  components  and  dependency  matrices  for  QUB  and  QNB  are  shown  in  Table  8.  The 
HU-Delay  graphs  for  QUB  and  QNB  NFGs  realizing  f(x)  =  \[x  on  [1,2]  are  shown  in 
Figure  49. 
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Total  HUP  =  1.493%,  Total  Propagation  Delay  =  32.38  ns.  Total  HUP  =  1.514%,  Total  Propagation  Delay  =  39.64  ns. 


0  5  10  15  20  25  30  35  0  10  20  30  40 

Propagation  delay  (ns)  Propagation  delay  (ns) 


(a)  QUB 


(b)  QNB 


|  SIE 

|  Coeff.  Table 
|  Multiplier  1 
|  Multiplier  2 
|  Multiplier  3 
|  Adder  1 
I  Adder  2 


Figure  49  HU-Delay  Graphs  for  QUB  and  QNB  NFGs  Realizing  / (x)  =  Vx  on  the 

Interval  [1,2]  withn  =  16. 
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4.  Compact  Quadratic  NFGs 


The  models  for  compact  quadratic  NFGs  (model_Quad_Uniform_Compact 
and  model_Quad_NonUniform_Compact  in  Appendix  A.2)  use  the  basic  components 

that  are  necessary  to  compute  y  =  c2j  [x-  ,s\  )2  +  cu  (x-  si )  +  /(j.)  +  v,. ,  for  uniform  and 

non-unifonn  segmentations,  respectively.  Like  compact  linear  NFGs,  compact  quadratic 
NFGs  use  scaling  methods  [7]  to  reduce  the  size  of  the  multipliers.  It  is  much  more 
complex  to  determine  the  sizes  of  the  components  because  they  also  depend  on  the 
required  accuracy  of  the  NFG.  Larger  multipliers  can  provide  more  precise  results 
because  fewer  bits  are  truncated.  The  bit  widths  illustrated  in  Figure  50  are  only  an 
example.  The  sizes  cannot  be  generalized  because  they  depend  on  the  system  accuracies 
and  the  effects  of  truncating  bits  with  respect  to  a  particular  function.  Thus,  they  are  not 
analyzed  in  the  thesis,  although  the  model  can  be  easily  modified  to  apply  to  a  particular 
architecture  with  known  component  sizes.  In  depth  analyses  have  been  done  in  [13]  for 

fl 

exactly  rounded  quadratic  NFGs.  The  models  implemented  in  this  thesis  set  ql  =  q2  =  — 
for  general  comparisons. 


x[n- 1 :0] 


y[n- 1:0]  y[n-1:0] 


(a)  Uniform  Segmentation  (b)  Non-Uniform  Segmentation 

Figure  50  Compact  Quadratic  NFGs.  (After  [8]) 
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A  summary  of  the  components  and  dependency  matrices  for  QUB  and  QNB  are 
shown  in  Table  8.  The  HU-Delay  graphs  for  these  two  architectures  are  shown  in  Figure 
51. 


Total  HUP  =  0.7434%,  Total  Propagation  Delay  =  21.91  ns. 


Total  HUP  =  0.7641%,  Total  Propagation  Delay  =  37.21  ns. 


|  Coeff.  Table 
|  Adder  1 
|  Multiplier  1 
|  Multiplier  2 
|  Multiplier  3 
|  Adder  2 
I  Adder  3 


■i  0.15 


10  15  20  25 

Propagation  delay  (ns) 


10  20  30  40 

Propagation  delay  (ns) 


(a)  QUC 


(b)  QNC 


Figure  51  HU-Delay  Graphs  for  QUC  and  QNC  NFGs  Realizing  / (x)  =  Vx  on  the 

Interval  [1,2]  withn  =  16. 
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LUC  y  =  cu(x-st)  +  f(st)  +  vi 

Figure  46a 
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Figure  46b 
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QUC  y=c2i(x-sif 

Figure  50a 
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2*  x  (4 n  -  <7,  -  q2 )  Memory 
«-bit  Adder 
[q2  /  2)  -bit  Mult  18x18 
(w/2)-bit  Mult  18x1 8 
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/2-bit  Adder 
/2-bit  Adder 
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QNC  y=c2i(x-s,)2 

Figure  50b 
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n:k  SIE 

2k  x  (4/2  -  #i  -  q2 )  Memory 
«-bit  Adder 
(q2l  2) -bit  Mult  18x1 8 
(/2  /  2)-bit  Mult  18x1 8 
( /2  /  2 )  -bit  Mult  18x18 
«-bit  Adder 
«-bit  Adder 


00000000 
1  0  0  0  0  0  0  0 
0  1  0  0  0  0  0  0 
0  0  1  0  0  0  0  0 
0  1  1  0  0  0  0  0 
0  1  0  1  0  0  0  0 
0  1  0  0  1  0  0  0 
0  0  0  0  0  1  1  0 


Table  8  NFG  Model  Components  and  Dependencies. 
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D. 


CHAPTER  SUMMARY 


This  chapter  shows  how  components  are  organized  to  form  various  models  that 
represent  particular  NFG  architectures.  It  shows  the  assumptions  made  for  choosing  the 
size  of  each  component  within  each  model.  This  chapter  uses  the  complexity  and  delay 
estimations  from  the  Chapter  IV  to  estimate  the  complexity  and  delay  for  each  NFG 
model.  Future  models  can  be  constructed  in  similar  manner  with  components  sized 
specifically  for  particular  NFGs.  The  models  constructed  in  this  chapter  are  compared  in 
the  following  chapter  to  determine  the  best  segmentation  and  approximation  methods  for 
particular  functions.  The  next  chapter  analyzes  complexity  and  delay  trends  for  eight 
NFG  architectures  and  15  functions  over  a  wide  range  of  NFG  sizes. 
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V.  COMPARING  COMMON  NFG  ARCHITECTURES 


This  chapter  compares  the  basic  and  compact  NFGs  models  to  determine  best 
configuration  for  each  model  for  each  size.  The  first  function  in  Table  7  (/(x)  =  2V)  is 
used  as  an  example  in  this  section  but  Appendix  D  contains  the  same  plots  for  all  of  the 
functions  in  the  function  suite  in  0.  Figure  52  shows  FIUP  and  delay  versus  n  for  the  four 
basic  NFG  architectures  realizing  the  function /(x)  =  2X  on  the  interval  [0,1]. 


Basic  NFGs  realizing  f(x)=2x  on  the  interval  [0,1]  Basic  NFGs  realizing  f(x)=2x  on  the  interval  [0,1] 


(a)  HUP  (b)  Delay 

Figure  52  Basic  Architecture  Comparison  for  NFGs  Realizing /(x)  =  2X  . 


A.  COMPARING  UNIFORM  VERSUS  NON-UNIFORM  SEGMENTATION 

The  benefits  of  using  non-uniform  segmentation  can  be  seen  in  Table  7  by  the 
reduction  in  the  number  of  required  segments.  This  results  in  a  smaller  memory  size  than 
the  same  NFG  using  uniform  segmentation.  Flowever,  the  main  reason  the  hardware  of 
uniform  segments  is  less  than  for  non-unifonn  segments  is  the  SIE.  It  can  be  seen  in 
Figure  53  that  even  for  small  NFGs,  the  SIE  can  consume  more  resources  and  take  longer 
than  all  of  the  other  NFG  components  combined.  As  n  gets  larger,  the  portion  of  the 
HUP  and  delay  that  is  due  to  the  SIE  grows. 
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Total  HUP  =  0.3465%,  Total  Propagation  Delay  =  27.87  ns. 


0  5  10  15  20  25  30 

Propagation  delay  (ns) 


|  Memory 
|  Multiplier 
I  Adder 


(a)n=12bits  (b)n=16bits 

Figure  53  HU-Delay  Graphs  for  / (x)  =  2X  for  72=12  and  72=16  bits. 


1.  Comparing  Hardware 

Figure  52  clearly  shows  that  for  / (x)  =  2X ,  HUPlub  <  HUPLNB  and 
HUPqub  <  HUPqnb  for  all  n.  Also  tLUB  <  tLNB  and  tQUB  <  tQNB  for  all  n.  The  savings  in 

memory  by  using  non-uniform  segmentation  is  generally  counteracted  by  the  size  and 
delay  of  the  SIE  that  is  required.  Thus,  in  almost  all  cases  it  is  better  to  use  uniform 
segmentation.  13  of  the  15  functions  in  0  yield  this  result  (Appendix  D).  The  functions 

that  do  not  behave  the  same  are  function  10  [fix)  =  I n  x  j  and  function  12 

(  fix)  =  (x  -1 )  log2(l  -  x)  -  x  log2  xj .  Figure  54  shows  that  for  function  10,  non-uniform 

segmentation  using  an  SIE  requires  less  hardware  than  uniform  segmentation  for  both 
linear  and  quadratic  NFGs.  It  also  shows  that  for  function  12,  non-uniform  segmentation 
requires  less  hardware  only  in  quadratic  NFGs. 
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Basic  NFGs  realizing  f(x)=sqrt(-log(x))  on  the  interval  [0.0019531,0.25]  Basic  NFGs  realizing  f(x)=0-(x*log2(x)+(1-x)*log2(1-x))  on  the  interval  [0.0039063,0.99609] 


Figure  54  Cases  Where  Non-uniform  Segmentation  is  Requires  Less  Hardware  than 

Uniform  Segmentation. 


The  main  factor  is  the  number  of  segments,  which  is  mostly  affected  by  the 
function  properties.  Part  of  Table  7  is  shown  in  Table  9.  For  function  1  (f(x)  =  2xj, 

sLN  «  sLU  x  84%  and  sQN  ~  sQU  x  89% .  Compare  these  memory  savings  to  those  for 
function  10,  where  sLN  ~  sLU  x 4.2%  and.s'(;v  «  sQi;  x  4. 1  %  .  Here,  non-unifonn 
segmentation  drastically  reduces  the  required  number  of  segment,  s,  so  much  that  the 
combined  hardware  for  the  SIE  and  memory  for  a  non-unifonn  NFG  is  less  than  that  of 
the  memory  required  for  a  uniform  NFG.  This  explains  why  for  both  linear  and  quadratic 
NFGs,  non-uniform  segmentation  requires  less  hardware  (Figure  54a).  For  function  12, 
sLN  «  sLU  x  18.2%  and «  sQU  x9%  .  Notice  that  the  savings  in  memory  is  less  for 
linear  NFGs  than  it  is  for  quadratic  NFGs.  The  graph  in  Figure  54b  shows  this  as  well. 
In  fact,  non-unifonn  segmentation  only  benefits  quadratic  functions  because  there  is  a 
bigger  reduction  in  the  number  of  required  segments. 


83 


# 

Function 

f(x) 

Interval 

#  of  Segments 

^  =  2~17 

#  of  Segments 

£  =  2-24 

#  of  Segments 

^  =  2~33 

LU 

LN 

QU 

QN 

LU 

LN 

QU 

QN 

LU 

LN 

QU 

QN 

1 

2X 

[04] 

89 

75 

8 

7 

1004 

849 

39 

35 

22714' 

19196' 

311 

in' 

10 

yj-lnx 

r  i  ii 

[512*4] 

49332 

209' 

793 1 

33 

55806 

235s1 

3995' 

163 

12627442 

53340' 

319572 

1302' 

12 

(x-l)logj(l-x) 

-xlog. 

[256  256 J 

1730 

3151 

3981 

37 

19564' 

35571 

2006' 

182' 

442676 

80480' 

160472 

1455' 

Table  9  Functions  with  a  Large  Number  of  Segments. 


Figure  55  shows  how  much  of  the  NFG  hardware  is  consumed  by  the  SIE  alone 
for  NFGs  with  non-uniform  segmentation  for  / (x)  =  2X .  Note  that  SIEs  generally 
contribute  to  at  least  20%  of  the  total  NFG  delay  for  a  small  n,  and  over  90%  of  the  delay 
for  larger  n.  For  a  16-bit  LNB  NFG,  over  50%  of  the  NFG  hardware  complexity  is  in  the 
SIE.  The  majority  of  a  28  -bit  QNB  NFG  is  also  made  up  of  the  SIE  alone.  Graphs  for 
the  other  functions  in  Table  7  display  similar  characteristics  for  NFGs  with  non-uniform 
segmentation.  These  are  shown  in  Appendix  D. 


HUPsie/HUPnfg  for  f(x)=2x  on  the  interval  [0,1]  WWg  for  f(x)=2X  on  the  interval  [0,1] 


n  (bits)  n  (bits) 


(a)  Hardware  consumed  by  SIE  (b)  Delay  due  to  SIE 

Figure  55  Percent  Hardware  Utilization  and  Delay  due  to  SIE  for  / (x)  =  2X . 


We  now  seek  a  criterion  to  determine  when  it  is  better  to  use  uniform 

segmentation  and  when  it  is  better  to  use  non-unifonn  segmentation.  Specifically,  we 

seek  to  establish  the  crossover  point  between  these  two  based  on  hardware  utilization.  In 

order  to  understand  where  the  crossover  occurs,  we  must  examine  the  NFG  components 

84 


closely.  The  components  for  an  //-bit  LUB  NFG  are  exactly  the  same  as  a  LNB  except 
for  the  memory  size  and  the  LNB  requires  an  n:k  SIE.  Here  we  will  analyze  the 
differences  between  the  two  architectures. 


For  a  given  function /(x) ,  let  s""n  imif  =  2  k,S2  'min  and  5"""  =  2  k,S2  'mi"  1  be  the 
number  of  segments  required  for  non-unifonn  and  uniform  segmentation,  respectively. 
They  depend  on  the  particular  function  and  interval  as  well  as  the  required  precision.  The 
values  for  sn™~unif  and  sl‘f  are  known  for  the  various  precisions  of  the  15  functions  in 
Table  7.  They  can  also  be  computed  by  using  the  author’s  function  segments.  Define 

non-unif 

the  Segment  Reduction  Ratio  (SRR)  to  be  SRR  =  L  mi”  . — .  The  SRR  represents  the 

^"min 

number  of  segments  required  for  an  NFG  with  non-uniform  segmentation  compared  to 
uniform.  The  number  memory  bits  required  for  the  LUB  NFG  is  M  r  =  2k"  x  vv ,  where 

ku  =  |" log2  ~|  and  the  word  size  stored  at  each  memory  location  is  w  =  2n  for  a  LUB 


(orw  =  3 n  for  a  QUB). 
the  coefficients 


M 


non-unif 


n  -  k„ 


\k„+2 


Let  Mnon_unif  be  the  memory  bits  required  to  realize  the  SIE  and 
table  for  the  non-unifonn  NFG.  Thus, 

•  kn  +  2k"  x  w ,  where  kn  =  [~ log2  .s''"”  “",f  J  .  This  assumes  that  the 


coefficients  table  contains  a  power  of  2  memory  locations.  A  non-uniform  NFG  requires 
more  hardware  than  a  unifonn  NFG,  when  M non  imi/  >  Munjf  .  Now  define  SSRcrit  to  be  the 

value  of  SRR  when  Mnon_unif  =Munif  ,  or 


n  -  k„ 


•>k.+  2 


+  2k ''  xw  =  2k“  x w . 


Let  SRR  =  SRR  ,,  and  substitute  s 


unif  _  S 


non-unif 


SRR 


into 


K  =  \  log 


2  ffl  ,  therefore 


k„  = 


non-unif 

lOg  fmin - 

SRRmt 


log2<"r  +  log2 


SRR 
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Since  we  assume  that  s'wn  imiJ  =  2  log: ?  ( sno"  umJ  is  an  integer  power  of  2),  then 


l°g2  snon~unif  +  log2  — ^ — 

=  log2  s"on-unif  + 

sRR„, 

"  RRRmt 

Therefore, 


n~K 

2 


%  k  +2 


+  2  "  xw  =  2 


log2  s’" 


log2 


SRRr 


XW  = 


X  W 


Dividing  both  sides  of  the  equation  by  2k"  yields, 


n  -  k„ 


k„+w  =  21 


log2 


SRR,,. 


XW  = 


SRR 


x  w 


Knowing  that 


SRR 


> 


SRR 


n  -  k„ 


■k„  +  w>- 


SRRcri, 


-XW 


Solving  for  SRRcrtl  yields 


SMa*  * 


w 

4 

n~K 

■  k„  +  w 

i 

2 

This  equation  is  plotted  in  Figure  56  for  basic  linear  and  basic  quadratic  NFGs.  Now  we 
seek  to  find  the  minimum  value  of  SRRcrit .  First  consider  the  case  where  n  is  even.  Since 
ktl,n  eT  andk,  is  even,  n  -  kn  is  even.  Thus,  we  can  remove  the  ceiling  function. 


SRR 


> 


w 


1 


(2b -2  k,)k,  +  w 


(2n-2ftn)— +1 
w 
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Therefore, 


ODD  I  >  _ 1 _ 

crit  \n=even  h.  h.2 

2n--2  —  +  l 
w  w 


For  basic  linear  NFGs,  w  =  2 n  .  For  basic  quadratic  NFGs,  w  =  3 n .  Thus, 

1  ^  ««« Basic  Quadratic  I  ~  1 


SRR 


Basic  Linear 


> - - and  SRRcril 

k-^+i 


> 


3  3  n 


For  cases  when  n  is  odd,  n  -  k„  is  odd.  Thus, 


n  -  k„ 


n  -k- 1 


n  -k- 1 


+  1 


Therefore, 


SRR 


>>  w 

w 

\n=odd 

4 

a 

1 

1 

■K  +  w 

2n-2kn  +  2)-kn  +  w 

l  2  J 

This  reduces  to  : 


SRKrit  n=odd  - 


1 


k  k2  k 
2n-B--2  —  -2  —  +  \ 

WWW 


For  linear  and  quadratic  cases,  w  =  2n  and  w  =  3n  .  Thus, 

1 


SRRZ 


Basic  Linear 


> 


n=odd  lr^~  L- 

n  n 


and 


SRRTn 


Basic  Quadratic 


> 


3/2 


2kjL_2kl_2K+l  k  _K_K  +  3 
3  3  n  3  n  "  n  n  2 


k 

Since  —  >  0 , 


SRR 


Basic  Linear  | 
crit 


\n=odd 


>  SRR 


Basic  Linear 


and  SRRfrfcQmdmtic 


n—odd 


>  SRR 


Basic  Quadratic 


Thus,  the  minimum  critical  SRR,  SRRcrjl  mm  can  be  found  by  finding  the  minimum  of 

1 


SRR 


or  the  maximum  of 


SRR,,, 


.  Differentiating  the  latter  is  much  simpler 


and  provides  the  same  information.  Thus, 


5 


dk„  SRR 


Basic  Linear 
crit 


dk„ 


fr2 

— +1 
n 


=  0 


Solving  for  kn  yields 


1-^  =  0^*,,  =11 
n  2 


This  means  that  the  maximum  of 


SRR 


occurs  when  k„  =  — ,  therefore  the 
2 


minimum  of  SRR 


occurs  when  kn  =  — .  Applying  the  same  process  to  the 


quadratic  case  yields  the  same  results.  Substituting  kn  =  —  to  find  SRRcrit  min  yields 


C  D  D  Basic  Linear  ^ 
^^^crit,  min  — 


and 


CD  D Basic  Quadratic  \ 
cm,  min  — 


1 

1 

n  j 

f n ^ 

i2  1  n 

-  +  1  T  +  1 

2~l 

\2) 

1  n  4 

3/2  3/2 

n  +  4 


n 

2 


n 

v2  j 


1  3  «+3  n+ 6 

n  +  2  4  2 


SRRcrit  min  is  the  minimum  SRR  below  which  non-uniform  segmentation  requires 
less  hardware  than  uniform  segmentation,  regardless  of  kn  or  ku .  Thus,  SRRcrjt  min  is  also 
independent  of  the  number  of  segments,  snon^unif  and  sumf .  It  is  shown  that  SRRcrit  min  is 
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only  a  function  of  n  when  sn 


-unif  _  2 [log2^min  Henri  vunif  —  of*0®2*' 


and  sumJ  =  2' 


nf 

min 


.  Recall  the  definition, 


non-unif 

SSR  =  min  —  .  Also  recall  that  for  linear  NFGs, 

umf 

^min 


Sun'f  — 

13  min 


b -a 

(7 

max 


b-a 

e 

\|/l2v> 


and  C:-‘f  =4e)a^j.fJfm(.x)\dx. 


Therefore, 


SRRUnc'dr  □ 


IV 

f(2\x) 

dx 

(b-a 

0^/aV) 

for  small  s .  For  the  analyses  in  this  thesis,  s  is  sufficiently  small.  For  all  of  the 
functions  in  Table  7,  the  maximum  difference  in  SRR  is  0.022.  The  largest  s  in  Table  7 
is  2~17 .  Since  practical  NFGs  generally  require  s<2x\  calculating  the  SRR  for  a 
function  using  the  asymptotic  equations  above  relatively  accurate. 


Clearly,  the  SRR  of  a  particular  NFG  depends  only  on  the  function  being  realized 
and  its  domain  [a,b\  and  not  on  s .  Therefore,  SRR  does  not  depend  on  n.  This  is  also 
confirmed  by  comparing  the  SRRs  in  Table  10  ,  which  are  calculated  from  the  numbers  of 
segments  in  Table  7.  The  significance  of  this  conclusion  is  that  if  the  number  of 
segments  for  a  particular  function  is  known  for  both  uniform  and  non-uniform 
segmentations,  then  SRRcrit  can  be  found  as  a  function  of  n  and  s’™~u"if  .  Since  the  SRR 
of  a  particular  function  does  not  depend  on  n,  the  relation  between  SRRrnl  and  SRR 
determines  at  what  values  of  n  non-uniform  segmentation  is  beneficial. 


Once  n,f(x ) ,  and  [a,b]  are  known,  it  can  be  determined  easily  if  a  non-uniform 
segmentation  is  always  beneficial  independent  of  the  number  of  segments  required.  If 


Basic  Linear 


IV 

/(2V) 

dx 

(b-a 

=  SSR 


Basic  Linear 


then  a  linear  NFG  using  non- 


uniform  segmentation  requires  less  hardware  than  the  same  NFG  using  uniform 
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segmentation.  These  calculations  are  based  on  using  SIEs  comprised  of  LUT  cascades 
and  using  Chebyshev  polynomials  to  compute  the  coefficients  for  each  segment  [5]. 

The  results  are  shown  in  Figure  56.  SRRs  for  equations  10  and  12  have  also  been 
plotted  in  Figure  56.  There  are  three  points  for  each  function  corresponding  to  the 
calculated  values  for  each  precision  in  Table  10.  Notice  that  for  equation  10 

( f(x)  =  v-  lnx  ),  SRReq  10  «  0.04  for  both  linear  and  quadratic  NFGs.  These  are  below 
any  of  the  SRRcrit curves  shown  for  both  the  linear  and  quadratic  NFGs  for/;  <  64 . 
Correspondingly,  the  FIUP  plots  in  Figure  54  shows  that  HUPLNB  <  HUPlub  and 
HUPqnb  <HUPoub.  For  equation  12,  SRREQl2  ~  0. 1 8  for  linear  NFGs,  lying  above  the 
SRRCrit  curve  for  n  >  24.  This  means  that  uniform  segmentation  consumes  less  total 
hardware  for  a  24-bit  NFG  realizing  /(x)  =  (x-l)log2(l-x)-xlog2  x  than  non- 
uniform  segmentation.  This  is  also  shown  in  Figure  54a 
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# 

f(x) 

Interval 

SRR(<?  =  2~17) 

SRR(^  =  2  24) 

SRR  ( 

c  =  2  13 ) 

Linear 

Quadratic 

Linear 

Quadratic 

Linear 

Quadratic 

1 

2X 

[0,1] 

0.842 

.875 

.845 

0.897 

0.845 

0.891 

2 

l/x 

[U] 

0.6 

0.625 

0.586 

0.617 

0.586 

0.619 

3 

Vx 

[U] 

0.760 

0.714 

.758 

0.75 

0.757 

0.738 

4 

1/ \[x 

[1.2] 

0.63 

0.727 

0.637 

0.655 

0.636 

0.655 

5 

log2(x) 

[1.2] 

0.697 

0.692 

0.693 

0.688 

0.693 

0.694 

6 

ln(x) 

[1.2] 

0.692 

0.667 

0.693 

0.696 

0.693 

0.694 

7 

sin(^x) 

H] 

0.762 

0.857 

0.693 

0.829 

0.763 

0.824 

8 

cos(;rx) 

N] 

0.762 

0.857 

0.763 

0.829 

0.763 

0.824 

9 

tan(^x) 

K] 

.510 

0.667 

0.511 

0.659 

0.510 

0.651 

10 

V-lnx 

r  !  ii 

[_5 12  ^ J 

0.042 

0.042 

0.042 

0.041 

0.042 

0.041 

11 

tan2  ( nx )  + 1 

H] 

.537 

0.533 

0.535 

0.523 

0.535 

0.523 

12 

(x-l)log2(l-x) 

-x  log,  X 

^256,l  256 j 

.182 

0.093 

0.182 

0.091 

.182 

0.091 

13 

1 

l  +  e~x 

[0.1] 

0.714 

0.800 

0.731 

0.870 

0.730 

0.888 

14 

J-/f 

[o.VJ] 

0.654 

0.818 

0.650 

0.865 

0.650 

0.864 

15 

sin(eA) 

[0,2] 

0.369 

0.424 

0.369 

0.417 

0.369 

0.423 

Table  10  Table  of  SRR  for  the  Suite  of  Functions. 


For  the  majority  of  the  functions  in  Table  10,  SRR  >  0.5  .  This  is  well  above  all 
of  the  curves  in  Figure  56.  This  means  that  non-uniform  segmentation  results  in  higher 
hardware  utilization  for  13  of  the  15  functions. 

In  summary,  it  is  only  beneficial  to  implement  non-uniform  vice  uniform 
segmentation  when  it  can  be  shown  that  there  is  a  large  savings  in  the  number  of  required 
segments  (small  SRR).  The  minimal  amount  of  savings  SRRait  is  related  to  the  number 
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of  segments  and  the  size  of  the  NFG  being  implemented,  n.  If  the  coefficient  tables 
contain  a  power  of  2  memory  locations  (which  is  often  the  case  in  hardware),  this 
minimum  amount  of  savings  can  be  quantified.  The  actual  amount  of  savings  SRRf(x)  is 

shown  to  depend  only  on  f(x)  and  the  domain  of  the  NFG  realizing  it  [a,b~\.  Data  plots 
in  Appendix  D.l  show  which  particular  NFG  realizations  require  less  hardware  for 
particular  functions. 


The  derivations  of  SRRcrit  min  have  been  shown  above  for  the  basic  architectures 

described  in  Chapter  IV,  but  they  can  also  be  applied  to  other  architectures.  We  can 
generalize  the  process  by  allowing  w  to  remain  in  the  equations  for  SRRcrit. 


Since  SRR 


crit  \n=odd 


> 


1 


k  k2  k 
2h  — -2  — -2-i  +  l 


>  SRR 


> 


1 


w 


w 


w 


SRRcn,, nin  =  min  (, SRRcrit  ) . 

Now  we  find  the  minimum  of  general  equation: 

SRR.  J  > - - — x- 


k  k1 
2//  >-2  "  +1 


w 


w 


w/2n 


2n--2  —  +  l  kn  -  -zl  +  w/2n 
w  w  n 


Like  the  linear  and  quadratic  cases,  the  minimum  occurs  when  kn 


— .  Thus, 
2 


w/2n 


kn  -  —  +  w/2n 
n 


w/2n 
-  +  w/2n 


2  w/n 
n  +  2  w/  n 


This  detennination  stems  from  a  comparison  between  Munif  and  M mm  umf ,  and 

assumes  that  the  remaining  arithmetic  components  in  the  two  NFGs  are  exactly  the  same. 
For  example,  consider  the  compact  NFG  architectures  described  in  Chapter  IV.  The 
compact  linear  NFG  assumes  w  =  |~1.5«]  and  the  compact  quadratic  NFG 

assumes  w  =  3n  .  To  compare  other  architectures,  simply  replace  w  with  the  number  of 
bits  stored  at  each  location  in  memory. 
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2.  Comparing  Delays 


The  delay  graphs  for  basic  and  compact  NFGs  in  Appendix  D.l  show  that  for  all 
of  the  functions  in  the  function  suite  the  delay  is  larger  for  NFGs  with  non-uniform 
segmentation.  Figure  55b  shows  that  at  least  20%  of  the  delay  of  a  non-unifonn  NFG  is 
due  to  the  SIE  alone.  The  percent  delay  that  is  attributed  to  the  SIE  is  shown  for  15 
functions  in  Appendix  D.4. 

Again,  the  main  difference  between  uniform  and  non-unifonn  NFGs  is  the  SIE  in 
the  latter.  The  remaining  hardware  is  the  same,  and  contributes  the  same  delay  to  the 
total  delay.  This  section  compares  the  delay  for  a  coefficients  table  for  an  NFG  with 
uniform  segmentation,  {"""  =  t"^M ,  to  the  sum  of  the  delays  of  for  the  coefficient  table  and 
SIE  for  an  NFG  with  non-uniform  segmentation,  un"  =  +tSIE.  For.v  <  214 ,  or 

£<14,  a  single  BRAM  can  be  used  as  the  coefficients  table.  Thus,  tROM  =  tBRAM . 
Therefore,  if  both  kn<  14  and£  <  14 ,  then  =  t~f  =  tBRAM  and  tnon  umf  >  tumf  for  all 
n  because  of  the  SIE.  When  kn  >  1 4  and  ku  >14,  tuniJ  =  tBRAM  +  t(k  _H).1MUX  and 
tno"-unif  =tBRAM +ta  _lA).XMUX +tk +2.k  SIE .  I l'k  >21,  then  all  of  the  BRAM  on  the  Xilinx 

Virtex-II  would  be  consumed.  Thus  the  maximum  required  MUX  size  is  a  7:1  MUX. 
Figure  57  shows  that  a  7:1  MUX  has  a  delay  of  tm]X  max  =tnMUX  «  4.6 ns  .  To  find  the 

minimum  tSIE  when  kn  >14,  we  look  at  the  delay  for  a  16:15  SIE  because  n  must  be 
greater  than  kn.  Therefore,  tks}E^m  >  21 .6ns .  Since  when  kn>  14  and  ku>  14, 
i”,l  >  .  it  follows  that  >  r*  for  all  „. 
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(a)  MUX  Delay 

Figure  57 
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Since  ku  >  kn  for  all  non-unifonn  NFGs,  there  is  only  one  remaining  case  to 
consider:  when  kn  <  14  and/c,  >14.  Here  tunij  =  tBRAM  +t{ki_14)iMUX  and 

tnon  “nf  =  tBRAM  +  K-ksie  ■  tunif  >  r"  u"‘f  iff  1 4 ):i mux  >  K-Ksie  •  Agam. maximum  delay 

for  the  MUX  is  tnMUX  ~  4.6 ns  .  Figure  58  shows  when  tnk  SIE  <  4.6 ns ,  the  x-axis  is  kn  and 
the  y-axis  is  n  -  kn . 


Figure  58  Delay  for  SIE  <4. 6ns. 

The  maximum  size  SIE  where  the  delay  is  less  than  that  of  the  maximum  MUX  is 
an  8:6  SIE.  This  means  n  can  be  at  most  8-bits.  Therefore,  when n  >  8,  an  NFG  with 
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uniform  segmentation  is  always  faster  than  one  with  non-unifonn  segmentation.  In 
addition,  in  order  for  a  non-unifonn  NFG  to  be  faster  than  a  uniform  NFG,  it  would 
require  that  kn  <  6  and£M  =21.  This  means  that  s non~unif  <  26  =  64  segments  and 

snon-unf  ~  221  =  2, 097, 1 52  segments.  Correspondingly,  SRR  <  2  15  *  0.00003 .  In 

summary,  there  are  not  likely  any  practical  cases  where  t‘""r  >  tnon~unif .  The  plots  in 
Appendix  D.  1  confirm  this. 

B.  COMPARING  LINEAR  VERSUS  QUADRATIC 

When  considering  whether  to  use  quadratic  or  linear  NFGs,  there  are  tradeoffs  to 
consider.  The  tradeoff  comes  between  arithmetic  component  hardware  size  and 
coefficient  table  size.  The  size  of  the  coefficients  table  depends  on  the  function  and 
interval.  For  a  given  function,  the  number  of  segments  is  less  for  a  quadratic  NFG  than 
for  a  linear  NFG.  But  the  basic  quadratic  NFG  requires  three  coefficients  for  each 
segment  while  the  basic  linear  requires  only  two.  Thus,  the  coefficient  table  is  150%  that 
of  the  linear  NFG.  In  addition,  quadratic  NFGs  require  additional  multipliers  and  adders 
which  grow  in  complexity  as  n  grows.  The  tradeoff  occurs  when  n  gets  big  such  that  the 
coefficients  table  becomes  a  larger  percentage  of  the  overall  NFG  complexity  than  the 
rest  of  the  arithmetic  components.  An  example  of  when  the  crossover  occurs  is  shown  in 
Figure  52  for  both  HUP  and  delay.  For  the  function  f(x)  =  2X  on  the  interval  [0,1], 
when  n  <  40  ,  tLUB  <  tQUB ,  and  when  n  <  21 ,  HUPlub  <  HUPoub  .  This  is  only  one 

example,  but  the  graphs  in  Appendix  D.l  show  where  the  crossovers  occur  for  the 
remaining  14  functions  in  the  function  suite. 

The  HU-Delay  graphs  in  Figure  59  and  Figure  60  compare  16-bit  NFGs  realizing 
f(x)  =  2x  on  [0,1].  The  total  HUP  and  delay  are  less  for  the  LUB  than  for  the  QUB. 
Clearly,  the  linear  NFG  is  better.  Now  compare  the  non-uniform  NFGs  in  Figure  60. 
Since  the  SIE  makes  the  linear  NFG  much  bigger  and  have  a  larger  delay,  the  delay  of  the 
LNB  is  longer  than  that  of  the  QNB.  However,  the  QNB  requires  more  hardware  than 
the  LNB. 
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Total  HUP  =  0.3742%,  Total  Propagation  Delay  =  18.07  ns. 
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Total  HUP  =  1.493%,  Total  Propagation  Delay  =  32.38  ns. 
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Figure  59  HU-Delay  Graph  Comparing  LUB  and  QUB. 
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Total  HUP  =  0.9646%,  Total  Propagation  Delay  =  55.31  ns. 
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Total  HUP  =  1.514%,  Total  Propagation  Delay  =  39.64  ns. 
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Figure  60  HU-Delay  Graph  Comparing  LNB  and  QNB. 
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In  general,  for  large  n,  it  is  better  to  implement  quadratic  NFGs  for  a  given  type  of 
segmentation.  When  the  reduction  in  coefficient  table  size  from  quadratic  to  linear  NFGs 
accounts  for  the  reduction  in  arithmetic  component  complexity  from  linear  to  quadratic 
NFGs,  then  quadratic  NFGs  become  less  complex  than  their  linear  counterparts.  Since 
memory  and  SIE  sizes  depend  on  the  particular  function,  generalizing  a  criterion  for 
deciding  whether  a  linear  or  quadratic  NFG  requires  more  hardware,  or  has  a  longer 
delay,  is  extremely  difficult.  For  this  reason,  we  apply  the  data  collected  from 
estimations  using  the  models  in  Chapter  IV.  The  crossover  points  for  delay  and  hardware 
utilization  can  be  found  in  the  graphs  in  Appendix  D.l.  The  crossover  points  for  delay 
and  HUP  often  occur  at  separate  values  of  n.  This  means  that  if  it  is  desired  to  minimize 
hardware  usage  instead  of  the  delay,  then  the  HUP  crossover  must  be  considered. 
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c. 


CHAPTER  SUMMARY 


This  chapter  shows  how  the  estimation  tools  developed  in  Chapter  IV  are  used  to 
analyze  characteristics  of  common  NFG  architectures.  It  analyzes  eight  NFG  models  for 
15  functions,  providing  graphical  data  that  shows  which  architecture  consumes  the  least 
hardware  or  has  the  smallest  delay  for  each  function.  This  data  shows  that  quadratic 
NFGs  require  less  hardware  and  have  shorter  delays  as  the  size  of  the  NFG  gets  larger.  It 
also  establishes  a  criterion  for  when  non-uniform  segmentation  is  beneficial  for  a 
particular  function,  based  on  the  size  of  the  NFG.  The  findings  in  this  chapter  show  that 
NFGs  with  non-uniform  segmentation  generally  require  more  hardware  and  almost 
always  have  longer  delays  than  NFGs  with  uniform  segmentation.  Chapter  VI 
summarizes  the  findings  in  this  chapter  and  the  development  of  the  models  in  this  thesis. 
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VI.  CONCLUSIONS  AND  RECOMMENDATIONS 


This  thesis  develops  a  software  model  for  estimating  complexity  and  delay  for 
NFGs.  It  also  uses  the  software  to  analyze  characteristics  of  common  NFGs. 

A.  SOFTWARE  MODEL 

This  thesis  shows  how  complexities  and  delays  for  NFGs  can  be  estimated 
without  having  to  build  them.  The  software  framework  developed  in  this  thesis  provides 
a  fast  method  for  comparing  NFGs  over  a  wide  range  of  functions,  architectures,  and 
sizes. 


1.  Comparing  Common  NFG  Component  Complexity  and  Delay 

The  software  can  be  used  to  find  hardware  utilization  and  delay  for  several 
components.  The  implementations  of  common  NFG  components  in  specific  FPGA 
hardware  are  analyzed  in  depth  to  estimate  their  complexity  and  delay  based  on  the 
number  of  inputs,  n  (up  to  n=  128).  Specific  simulation  data  from  behavioral  models  and 
schematic  circuits  is  used  in  detennining  the  complexity  and  delay  of  each  component. 
Missing  data  is  interpolated  with  linear  approximations.  The  software  provides  a  quick 
and  simple  way  to  determine  hardware  utilization  and  delay  for  a  particular  component. 
This  allows  various  components  to  be  compared  to  detennine  which  best  suits  a 
particular  application. 

2.  Modeling  and  Comparing  NFGs 

This  software  provides  a  simple  means  to  combine  several  components  in 
series/parallel  configurations  to  represent  an  NFG  or  other  arithmetic  logic  device.  The 
software  detennines  the  worst  case  propagation  delay  through  the  NFG  as  well  as  the 
total  hardware  used  by  the  NFG.  It  can  be  used  to  compare  various  NFG  architectures 
for  various  sizes.  The  HUP -Delay  graphs  can  be  used  to  visually  compare  NFGs,  as  well 
as  visually  compare  the  relative  sizes  and  delays  of  the  components  inside  them. 
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B. 


RESULTS  OF  NFG  ANALYSES 


The  results  provide  an  easy  way  to  choose  the  best  architecture  based  on  hardware 
complexity  and/or  delay.  This  thesis  also  shows  that  the  complexity  and  delay  of  an  NFG 
greatly  depend  on  the  complexity  and  delay  of  its  coefficient  table  and  associated  SIE 
(for  non-uniform  NFGS). 

1.  Benefits  of  Non-uniform  Segmentation 

For  13  of  the  15  functions  analyzed  in  this  thesis,  non-uniform  segmentation 
offers  no  benefits.  However,  when  non-unifonn  segmentation  drastically  reduces  the 
number  of  segments  in  an  NFG,  it  can  reduce  the  overall  hardware  utilization.  The  delay 
is  almost  always  longer  for  NFGs  with  non-unifonn  segmentation. 

a.  A  Criterion  when  Non-Uniform  Segmentation  Requires  Less 
Hardware 


The  majority  of  the  functions  in  Table  10  show  that  non-uniform 
segmentation  still  requires  at  least  50%  of  the  segments  requires  by  uniform 
segmentation.  Two  of  the  fifteen  functions  show  reductions  by  lower  than  10%.  This 
thesis  shows  a  criterion  that  can  be  used  to  determine  which  segmentation  method 
requires  less  hadware  for  basic  NFGs.  ft  compares  the  reduction  in  the  number  of 
segments  by  non-uniform  segmentation  ( SRR )  to  the  NFG  size,  n.  The  minimum  amount 
of  reduction  required,  SRRcrit  min ,  depends  on  the  number  of  segments  (which  depends  on 
s(n)  )  and  the  properties  and  domain  of  the  function  being  realized.  This  thesis  also 
shows  that  the  SRR  of  a  given  function  depends  only  on  the  properties  of  that  function 
and  the  domain  of  the  NFG  implanting  the  function.  When  the  number  of  segments 
(corresponding  to  the  number  of  memory  locations)  is  restricted  to  a  power  of  two, 
SRRCrit  min  becomes  a  function  of  n  only.  For  a  basic  linear  NFG,  if 


\bJf2\x)dx 
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segmentation  requires  less  hardware.  This  is  true  for  basic  quadratic  NFGs  when 
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< 


n  +  6 


.  From  these  equations,  a  critical  value  of  «  can  be  determined. 


ncrit ,  below  which  it  is  always  more  hardware  efficient  to  use  non-unifonn  segmentation. 

The  derivations  of  these  equations  assume  that  LUT  cascades  are  used  in  the  SIE  for  the 
non-unifonn  NFGs  and  Chebyshev  polynomials  are  used  to  determine  the  coefficients  for 
the  approximation  equations.  They  also  assume  the  basic  architectures  described  in 
Chapter  IV  are  used. 


b.  Delays  for  Non-Uniform  Segmentation 

This  thesis  shows  that  non-uniform  segmentation  always  has  a  longer 
delay  than  unifonn  segmentation,  except  in  rare  trivial  NFGs  (where  n<  8).  In  fact, 
when  NFG  architectures  for  15  functions  were  compared  in  terms  of  delay,  non-uniform 
NFGs  proved  the  best  only  in  a  few  cases  when  n  <  2  .  Ifn  <  2  ,  two  LUTs  can  be  used 
instead  of  an  NFG.  Therefore,  for  all  practical  NFGs,  propagation  delay  is  longer  when 
non-uniform  segmentation  is  implemented.  Appendices  D.2.2  and  D.3.2  show  the  best 
architectures  based  on  delay. 


2.  Linear  vs.  Quadratic  NFGs 

When  considering  linear  versus  quadratic  NFGs  for  the  15  functions  in  the  suite, 
LUB  NFGs  consume  less  hardware  than  QUB  NFGs  for  n  less  than  «  25  to  29  bits.  They 
also  have  smaller  delays  than  QUB  NFGs  for  n  « 37  to  39  bits.  Appendix  D.2  shows 
which  of  the  four  basic  architectures  is  best  in  terms  of  HUP  for  all  15  of  the  functions  in 
Table  7.  It  also  shows  which  is  better  in  terms  of  delay.  The  crossover  points  for 
compact  architectures  vary  from  the  basic  architectures.  Appendix  D.3  shows  the  best  of 
the  compact  architectures. 
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c. 


RECOMMENDATIONS  FOR  FUTURE  WORK 


The  method  of  estimating  component  complexity  and  delay  in  this  thesis  allows 
meaningful  comparisons  to  be  made.  The  software  developed  in  this  thesis  is  meant  to  be 
used  in  future  applications  with  minor  alterations. 

1.  Using  Other  FPGAs 

It  may  be  beneficial  to  estimate  hardware  utilization  and  delay  for  the  models 
developed  in  this  thesis  on  other  FPGAs.  The  author’s  MATLAB  file 
LoadXilinxDeviceData  contains  specifications  for  the  Xilinx  Virtex-II  XC2V6000 
FPGA  with  a  speed  grade  of  -4.  The  timing  and  hardware  parameters  can  be  specified 
for  other  Virtex-II  FPGAs  as  well.  HUandDelay  assumes  that  arithmetic  components 
are  constructed  as  described  in  Chapter  III.  The  method  of  component  construction  is 
common  to  all  Virtex-II  FPGAs.  Thus,  by  changing  the  parameters  in 
LoadXilinxDeviceData,  complexity  and  delay  estimations  can  be  made  easily  for  the 
entire  family  of  FPGAs.  To  estimate  FPGAs  other  than  Virtex-II,  minor  alterations  to 
HUandDelay  are  required  to  allow  for  variations  in  component  construction.  For 
example,  the  Virtex-II  resources  include  18-bit  signed  multipliers.  Other  FPGAs  may  not 
contain  multipliers  at  all.  Therefore,  the  multiplier  estimation  section  has  to  be  re-written 
to  provide  estimations  based  on  how  the  specific  FPGA  implements  multipliers. 

2.  Creating  and  Comparing  Other  Models 

Each  of  the  eight  models  in  this  thesis  has  been  constructed  in  a  standard  manner. 
They  can  be  used  as  templates  to  build  other  models. 

a.  Analyzing  Other  Methods  for  Reducing  NFG  Hardware  and 
Delay 

Modern  research  concerning  NFGs  often  focuses  on  reducing  hardware 
and/or  delay.  Research  in  [5]  shows  a  reduction  in  the  number  of  segments  by 
implementing  non-uniform  segmentation,  resulting  in  dramatic  reduction  in  the  amount 
of  memory  required  for  the  NFG.  Other  research  shows  that  a  reduction  in  arithmetic 
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component  size  can  be  achieved  by  other  means  [4].  For  example,  using  linear  NFGs 
that  have  a  slope  that  is  a  power  of  two  reduces  complex  mulipliers  into  simpler  barrel 
shifters.  Models  can  easily  be  built  to  compare  the  tradeoffs  between  the  several 
methods. 


b.  Comparing  NFGs  with  Specifically  Sized  Components 

Architectures  in  [13]  are  shown  to  reduce  arithmetic  component 
complexity  and  actually  specify  component  bit-widths.  Models  for  these  architectures 
can  be  constructed  and  compared  to  the  basic  models  in  this  thesis  to  illustrate  relative 
hardware  and  delay  savings.  The  size  of  each  component  in  the  NFG  can  be  specified  in 
the  model  file  (i.e.  model_*.m),  allowing  the  models  to  be  extremely  flexible. 

3.  Categorizing  Functions  that  Benefit  from  Non-Uniform  Segmentation 

This  thesis  shows  that  non-unifonn  segmentation  is  only  beneficial  when  SRRf(x) 
is  small.  The  values  of  SRRf(x)  depend  only  on  the  function  and  the  domain  of  the  NFG 
realizing  it.  For  linear  NFGs,  it  is  related  to  f{2](x) ,  and  for  quadratic  NFGs,  it  is  related 
to  f°\x) .  Specific  functions  can  be  found  where  SRR/  (x)  is  small.  Thus,  they  are  likely 
candidates  to  employ  non-unifonn  segmentation. 

4.  Analyzing  Domain/Range  Reduction  Methods  for  Reducing  NFG 
Hardware  and  Delay 

Aside  from  looking  at  the  properties  of  particular  functions,  examining  their 
domains  may  assist  in  reducing  the  number  of  segments,  which  reduces  the  complexity 
and  delay  of  the  NFG.  Domain  reduction  methods  allow  the  NFG’s  domain  to  be  shifted 
where  it  requires  fewer  segements.  However,  they  often  include  additional  arithmetic 
components.  Models  can  be  constructed  to  conduct  tradeoff  analyses  for  these  domain 
reduction  methods  so  that  optimal  domains  can  be  detennined. 
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APPENDIX  A.  MATLAB  SOURCE  CODE 


A. l  M-FILE  USAGE 

In  order  to  use  the  MATLAB  source  code,  all  of  the  m- files  in  this  appendix  are 
required  to  be  in  the  same  folder,  along  with  the  text  files  that  are  imported  (Appendix 

B. 2).  When  entering  commands,  or  calling  functions,  the  user  must  be  in  the  current 
directory  where  the  m-files  are  stored. 

1.  Comparing  Individual  Components 

To  compare  individual  components,  type  the  following  into  MATLAB’s 
command  window: 


[SUP  BUP  MUP  t]  =  HUandDelay(n, component, w) 


This  will  produce  the  SUP,  MUP,  BUP  and  delay  for  the  given  component.  The 
variable  ‘ component ’  is  a  string  that  matches  one  of  the  following  strings:  ‘Adder’, 
‘Mult’, ’Multi  8x1 8’, ’MUX’, ’RAM’, ’ROM’, ’BRAM’,’BS’,’SIE’, ’Mem’, ’CLB’,  or 

’SOP.’  The  values  of  n  and  w  are  the  input  word  width  and  output  word  width 
respectively.  In  some  cases,  the  complexity  and  delay  do  not  require  both  inputs.  A 
summary  of  all  of  the  components  that  can  be  analyzed  with  HUandDelay  is  shown  in 
Table  4. 

This  function  can  be  used  to  produce  the  hardware  utilization  and  delays  of 
various  sized  components  for  comparisons.  To  calculate  the  hardware  utilization  in  a 
single  term,  the  HUP  of  a  given  component  can  be  calculated  with  the  following 
command  once  SUP,  MUP,  and  BUP  are  calculated: 


HUPcomp  =  HUP(SUP,  MUP,  BUP) 
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2.  Comparing  NFG  models 


The  HUP  and  delay  can  be  found  for  several  NFG  architectures  that  have  been 
implemented  in  models.  The  following  commands  can  be  used  to  compare  various 
models: 


[HUP_comp  t_comp]  =  pickModel(ModelNum,n,s) 


This  will  return  the  HUP  and  delay  for  an  NFG  with  system  size  n,  that  requires  s 
segments.  The  variable  ‘ ModelNum ’  can  be  any  integer.  Table  11  summarizes  the 
models  that  are  implemented  base  on  the  value  of  ‘ ModelNum .  ’ 


ModelNum 

NFG  Model 

1 

LUB 

2 

LNB 

3 

QUB 

4 

QNB 

5 

LUC 

6 

LNC 

7 

QUC 

8 

QNC 

Table  1 1  Model  Number  Index. 


3.  Comparing  Functions 


The  HUP  and  delay  can  be  found  for  any  function  over  any  interval.  The  number 
of  segments  must  be  known,  or  the  function  must  meet  the  requirements  for  segment 
estimation  discussed  in  Chapter  IV.  The  functions  and  corresponding  domains  in  Table  7 
may  be  easily  returned  by  calling  the  function  funcSel  with  its  input  variable  equal  to  the 
index  number  of  the  function.  The  following  code  shows  how  to  get  the  HUP  and  delay 
for  a  given  function  on  a  given  interval  with  a  given  system  size  n. 
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modelNum=l  %  corresponds  to  LUB  NFG 
n=32  %  corresponds  to  the  system  size 

funcNum  =  1  %  corresponds  to  f(x)=2Ax  on  [0,1] 

[f  a  b]  =  funcSel(l) 
numSegs=segments(f,a,b,n) 

[HUP_NFG  t_NFG]  =  pickModel(modelNum,n,numSegs(l)) 

This  will  produce  the  HUP  and  delay  for  LUB  NFG  the  realizes  f (x)  =  2X  on 
[a,h].  The  variable  fiuncNum  ’  chooses  the  function  from  the  function  list,  and  returns  the 
functions  as  a  string  expression  and  the  domain  of  the  NFG  [a,b].  If  funcNum  ’  is  not  an 
integer  between  1  and  15,  then  funcSel  prompts  the  user  to  input  a  function  and  domain. 
Any  function  of  x  may  be  entered,  if  it  is  recognized  as  a  single-variable  function  in 
MATLAB.  The  author’s  function  segments  returns  the  number  of  segments  required  in 
a  vector  corresponding  to  the  segmentation  techniques,  [LU  LN  QU  QN],  To  implement 
a  particular  model  for  an  NFG,  choose  the  corresponding  number  of  segments 
(numSegs(l),  numSegs(2),  numSegs(3),  or  numSegs(4)  ). 

4.  Producing  HU-Delay  Graphs 

To  produce  a  HU-Delay  Graph  to  represent  an  NFG  or  other  arrangement  of 
components,  the  user  must  know  the  HUP  and  delay  for  the  components.  The  user  must 
also  construct  a  dependency  matrix,  based  on  the  arrangement  of  the  components,  and  a 
list  component  names.  Once  these  are  determined,  they  are  input  into  HUPBoxes  with 
the  following  command: 

[totHUP  totDelay]  =  HUPBoxes (components, dependency, compNames) ; 


The  variable  ‘components  ’  is  a  matrix  with  two  columns  and  a  row  for  every 
component  in  the  NFG.  The  first  column  holds  the  HUP  value  for  the  component 
corresponding  to  the  row  number.  The  second  column  holds  the  delay  value  for  that 
particular  component.  The  variable  ‘dependency  ’  is  the  dependency  matrix  discussed  in 
Chapter  IV.  The  variable  ‘ compNames  ’  is  an  array  of  strings,  where  each  row  holds  the 
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string  name  for  the  particular  component.  Each  string  (row)  must  be  the  same  length  in 
the  matrix  ‘ compNames .  ’  The  function  HUand Delay  will  return  the  total  delay  along  the 
worst  case  path  through  the  NFG  and  the  overall  HUP. 

A.2  MATLAB  FILES 

1.  M-file  List 

The  following  MATLAB  source  code  was  written  by  the  author.  Table  12  is  the 
list  of  m-files  and  their  dependencies. 


M-file/function 

Depends  on 

BlackLineStyle 

none 

boxesOrigin 

BlackLineStyle 

fillLin 

none 

funcSel 

none 

HUP 

none 

HUPBoxes 

none 

HUandDelay 

LoadXilinxDeviceData 

HUP 

fillLin 

HUandDelay  (Recursion) 

IMPORTS  data  from:  MultDelayWithNet.txt 
MultSlices.txt 
MuxDelayWithNet.txt 

LoadXilinxDeviceData 

fillLin 

IMPORTS  data  from:  NetDelay.txt 

model  Linear  NonUniform  Basic 

HUandDelay 

model  Linear  NonUniform  Compact 

HUP 

model  Linear  Uniform  Basic 

HUPBoxes 

model  Linear  Uniform  Compact 
model  Quad  NonUniform  Basic 
model  Quad  NonUniform  Compact 
model  Quad  Uniform  Basic 
model  Quad  Uniform  Compact 

totalHUPandDelay 

mylnt 

none 

pickModel 

model  Linear  NonUniform  Basic 
model  Linear  NonUniform  Compact 
model  Linear  Uniform  Basic 
model  Linear  Uniform  Compact 
model  Quad  NonUniform  Basic 
model  Quad  NonUniform  Compact 
model  Quad  Uniform  Basic 
model  Quad  Uniform  Compact 

segments 

mylnt 

symbolic\syms.m 

symbolic\syms.m 

totalHUPandDelay 

none 

Table  12  M-file  List  with  Dependencies. 
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2. 


M-file  Source  Codes 


FILE:  BlackLineStyle .m 


function  [styleCode]  =  BlackLineStyle ( index) ; 

%  This  function  returns  a  string  variable  to  be  used  as  a  line  style 
%  Written  by  Tim  Knudstrup,  August  30,  2007 

index=round (abs (index) ) ;  %  ensures  positive  integer 

numStyles  =  9; 

index  =  mod ( index, numStyles); 


switch  index 
case  1 


styleCode=  k- ' ; 
case  2 

styleCode=  k--'; 
case  3 

styleCode= ' k- . ' ; 
case  4 

styleCode= ' k : ' ; 
case  5 

styleCode=  k. : ' ; 
case  6 

styleCode=  k.-'; 
case  7 

styleCode= ' k+- . ' ; 
case  8 

styleCode= ' k* : ' ; 
case  9 

styleCode= ' k*- ' ; 
otherwise 

styleCode=  k- ' ; 

end 


FILE:  boxesOrigin .m 


function  [a]  =  boxesOrigin (s, t) 


boxesOrigin .m 

This  function/program  plots  HU-Delay  Graph  for  various  components 
and  each  component  is  centered  at  the  origin. 

function  [a]  =  boxesOrigin (s, t) 

Input : 


Output : 


s:  Vector  containing  Size  values 

t:  vector  containing  Time  Delay  Values 

:  Returns  1  if  no  error 
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%  Comments :  s  and  t  must  be 

o, 

o 

%  Created  by:  Tim  Knudstrup 

%  Date:  20  September  2007 

o, 

o 

S-S-S'S-S'S-S-S-S'S-S'S-S'S-S-S-S'S-S'S-S'S-S-S-S'S-S'S-S'S-S'S-S'S-S'S-S' 

ooooooooooooooooooooooooooooooooooooo 


the  same  length  % 

g, 

o 

o, 

o 

Q, 

o 

o, 

o 

9'9'9'9'2'9'9'9'9'9'2'9'9'9'2'9'9'9'5'9'5'9'9'9'2'9'2'9'9'9'9'9'9'S-2'S-5' 

ooooooooooooooooooooooooooooooooooooo 


%s=[2  3456  8]; 

%t=[10  2345  12]; 

inc=0 . 01 ; 

tAxisLength=max (t)  +1  ; 
sAxisLength=max ( s ) +1 ; 

tAxis= [ 0 : inc : tAxisLength]  ; 

NumComps=max (size (t) )  ; 
t  len=max (size (tAxis) )  ; 
sizeMatrix=zeros (NumComps, t  len) ; 

for  comp=l : NumComps 

tcum (comp) =tAxisLength-sum (t (comp+1 : end) ) ; 

end 

close  all; 
figure ( 1 ) 
comp  =1; 

for  comp=l : NumComps 
for  k=l : (t  len) 
tVal=k*inc; 
if  tVal  <=  t(comp) 

sizeMatrix (comp, k) =s (comp) ; 

end 

end 

end 

for  p=l: NumComps 
P 

coir  =  ( [rand ( 1 )  rand(l)  rand ( 1 ) ] ) . A 1 . 5 ; 
plot (tAxis, sizeMatrix (p, : ) , BlackLineStyle (p) ) 
hold  on 

end 

hold  off 

axis([0  tAxisLength  0  max ( s ) *1 . 2 ] ) ; 
legend 

ylabel ( ' HUP  (%)'); 

xlabel (' Delay  (ns)'); 

print  -depsc  -tiff  BoxesOrigin.eps 

a=l  ; 
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FILE:  fillLin.m 


function  [filledX  filledY]  =  f illLin (dataX, dataY) 


f illLin .m 


This  function  creates  filledX  and  filledY  vectors  containing  data  at 
every  integer  ranging  from  1  to  the  maximum  integer  value  of  of  dataX 
The  values  in  filledY  match  those  in  the  original  dataY,  and  for  data 
points  not  included  in  dataX,  filledY  values  are  estimated  using 
linear  approximation  between  the  data  points  that  do  exist. 

function  [filledX  filledY]  =  f illLin (dataX, dataY) 

Input:  dataX:  X  values  for  data  points 

dataY:  Y  values  for  data  points 

Output:  filledX:  X  values  from  1  to  max  dataX 

filledY:  Y  values  corresponding  to  filledX 

Comments:  1.  dataX  must  be  positive  integers  only 

2 .  dataX  must  be  the  same  length  as  dataY 


Created  by: 
Date : 


Tim  Knudstrup 
20  September  2007 


%  Trial  DATA 

%dataX  =  [1  2  5  9  20]; 

%dataY  =[3468  10]; 

dataX  =  round (dataX) ;  %  makes  sure  all  x  values  are  integers 


unit=l ; 

filledX  =  [ 1 : unit :max (dataX) ] ; 

len=length ( filledX) ; 
lenData=length (dataX) ; 
dummy= 123456789; 
filledY  =  dummy* ( 0*filledX+l )  ; 

filledY ( 1 ) =dataY ( 1 ) ; 


for  k=l:lenData 

filledY (dataX (k) ) =dataY (k)  ; 

end 
k=l  ; 

beginlndex=l ; 
endlndex=l ; 

while  (k  <  len) && (beginlndexklen) && (endlndexklen) 
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while  ( f illedY (beginlndex)  ~=  dummy) && (beginlndex<len) 
beginlndex  =  beginlndex+l ; 

end 

endlndex=begin Index ; 

while  ( f illedY (endlndex)  ==  dummy)  && (endlndex<len) 
endlndex  =  endlndex  +  1; 

end 

if  f illedY (beginlndex) ==dummy 
if  beginlndex  >  1 

m= (filledY (endlndex) -filledY (beginlndex-l ) ) / (filledX (endlndex) - 
f illedX (beginlndex-l ) ) ; 

b=filledY (beginlndex-l) -filledX (beginlndex-l) *m; 

end 


for  kk=beginlndex : endlndex 

filledY (kk) =filledX (kk) *m+b; 

end 

end 
k=k+l ; 

beginlndex=endlndex+l  ; 

end 

filledX=f illedX' ; 
filledY=f illedY ( 1 : len) 

%plot (dataX, dataY, filledX, filledY) 


FILE:  funcSel.m 

function  [fab]  =  funcSel ( funcNum) ; 

%  This  function  returns  the  string  representing  the  function 
%  and  its  domain  for  one  of  the  functions  in  the  function  suite. 

%  The  input  variable  'funcNum'  is  the  index  of  the  function  in  the 
%  function  suite. 

%  If  funcNum  is  not  an  integer  between  1  and  15,  then  the  user  is 
%  prompted  for  an  equation  and  domain. 

switch  funcNum 
case  1 

f='2Ax' ; 
a  =  0  ; 
b=l  ; 

case  2 

f='l/x' ; 
a  =  1; 
b  =  2; 
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case  3 

f= 1 sqrt (x) ' ; 
a  =  1; 
b  =  2; 

case  4 

f= ' 1/sqrt (x) ' ; 

a  =  1; 
b  =  2; 

case  5 

f=' log2 (x) ' ; 

a  =  1; 
b  =  2; 

case  6 

f= ' log (x) ' ; 
a  =  1; 
b  =  2; 

case  7 

f= ' sin (pi*x) ' ; 
a  =  0  ; 
b  =  0.5; 

case  8 

f= 1  cos (pi*x)  ' ; 
a  =  0  ; 
b  =  0.5; 

case  9 

f= ' tan (pi*x) ' ; 
a  =  0  ; 
b  =  0.25; 

case  10 

f= ' sqrt ( -log (x) ) ' ; 
a  =  1/512; 
b  =  1/4; 

case  11 

f=' (tan (pi*x) ) A2+l ' ; 
a  =  0  ; 
b  =  0.25; 

case  12 

f='0- (x*log2 (x) + (1-x) *log2 (1-x) ) ' ; 
a  =  1/256; 
b  =  1-1/256; 

case  13 

f=' 1/ (1+exp (-x) ) ' ; 
a  =  0  ; 
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b  =  1; 

case  14 

f=  1/ (sqrt (2*pi) ) *exp (-xA2/2)  '  ; 
a  =  0  ; 

b  =  sqrt (2) ; 

case  15 

f= ' sin (exp (x) ) ' ; 
a  =  0 ; 
b  =  2; 

otherwise 

f=input ( '  Enter  function  string  (ie  ''eAx''): 
a  =  input ( '  Enter  beginning  of  interval:  '); 
b  =  input ( '  Enter  end  of  interval:  '); 

) ; 

end 

FILE:  HUandDelay. m 

function  [SUP  MUP  BUP  delay]  =  HUandDelay (n, device, WordWidth) 


HUandDelay .m 

This  function  returns  Hardware  utilization  parameters  and  propagation 
delay  estimations  for  several  arithmetic  logic  devices  for  a  given  word 
size  n.  This  does  not  always  return  the  best  case  circuit  design, 
but  illustrates  the  effects  of  word-width  on  the  size  and  delay  of 
basic  arithmetic  logic  circuits. 

function  [SUP  MUP  BUP  delay]  =  HUandDelay (n, device) 


Input : 


n : 

device : 


1 

2 

3 

3 

4 

5 

6 

7 

8 
9 

10 


WordWidth : 


the  wordsize  of  the  arithmetic  device 

string  value  for  the  type  of  logic  device.  It 

may  be  one  of  the  following  devices: 

'Adder'  for  an  adder 

'Mult'  for  multiplier  built  from  CLBs 
' MULTI 8x1 8 '  for  a  multiplier  using  MULT18xl8s 
'MUX'  or  'mux'  for  a  multiplexer 
'RAM',  'ROM', 'DistRAM'  for  memory  devices 
' CLB '  for  general  n-input  logic  function 
'BRAM'  or  'BlockRAM'  for  Block  RAM  memory 
'BS'  or  ' BarrelShif ter '  for  a  BarrelShif ter 
'SIE'  for  a  segment  index  encoder  (LUT  Cascade) 
'MEM'  or  'Mem'  picks  the  best  from  ROM  or  BRAM 
'SOP'  for  a  worst  case  SOP  with  n  variables 
the  number  of  bits  of  the  output  from 
(used  for  MEM  BRAM  and  CLB  only) 
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o, 

o 

%  Output : 

o, 

o 

o, 

o 

g, 

o 

o, 

o 

%  Comments: 

Q, 

O 


SUP:  Slice  Utilization  Percentage 

MUP:  MULT18xl8  Utilization  Percentage 

BUP:  BRAM  Utilization  Percentage 

delay:  propagation  delay  forthe  logic  device 


o, 

o 


g, 

o 


g, 

o 


%  Created  by:  Tim  Knudstrup 
%  Date:  13  October  2007 

o, 

o 


ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo 


%  loads  the  Hardware  Specifications  for  the  Xilinx  Virtex-II  XC2V6000 
LoadXilinxDeviceData; 

WordWidth=  ceil (WordWidth) ; 

%************  CALCULATING  AREA  USED  ******************* 

switch  device  %DistRAM  assumes  SinglePort  (Dual  Port  is  twice  as  much 
space) 

case  { 1 CLB ' , 'ROM' , 'DistRAM' , 'Rom' , ' LUT ' } 

%  ROMs  are  constructed  from  Xilinx  Primitive  RAMs,  using  read  time 
%  delays  from  the  Address  input  bits  to  the  Data  output  bit. 

%  Maximimum  distributed  RAM  primitive  is  128x1,  or  7  address  bits. 

%  Thus,  if  n  >  7,  larger  ROMs  are  constructed  using  2A(n-7)  128x1 

%  ROMs,  combined  with  2A(n-7) :1  MUX  network.  For  large  n  >  14, 

%  Block  RAM  should  be  used  to  avoiding  using  up  all  of  the  CLBs . 

fanout=ceil (2A (n-4) ) *WordWidth;  %  also  accounts  for  the  fanout 
inside  each  ROM  unit 

if  fanout  >  129 
fanout  =129; 
elseif  fanout  <  1 
fanout  =1; 

end 

RomPrim=n;  %  m  is  index  into  a  single  nxl  ROM  where  n  is  at  most  7 
if  RomPrim  >  7 
RomPrim=7 ; 

end 

if  n  >  0 

ROMdelay=tNxlROM (RomPrim) ;  %  delay  of  a  single  Nxl  ROM  (where  n 

<=7) 

else 

ROMdelay=0 ; 

end 

NumMuxLevels=n-7  ; 
if  NumMuxLevels  <  0 
NumMuxLevels  =  0; 

end 

NumROMs=2  ANumMuxLevels ; 
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LUTSperROM=ceil (2A (RomPrim-4) ) ; 

SUPperROM=100*LUTSperROM/ 2/TotalSlices; 

[ SUP_MUX  MUP_MUX  BUP_MUX  tMUX]  = 

HUandDelay (NumROMs , ' MUX ' , WordWidth) ; 

if  NumMuxLevels  ==  0 
tMUX  =  0; 

SUP_MUX=0 ; 

end 

SUP= ( NumROMs *SUPperROM+SUP_MUX) *WordWidth; 

BUP=0 ; 

MU  P= CU- 

delay  =  tNET (fanout) +ROMdelay+tNET (1) +tMUX; 

if  n<=0 

SUP=0 ; 
delay=0 ; 

end 

case  { ' BlockRAM ' , ' BRAM ' } 

k=ceil (n) ;  %  k  is  defined  in  thesis  as  the  number  of  address 

lines 

NumMemLocations  =  2Ak; 

ReqMemBits  =  NumMemLocations*WordWidth; 

NumBlocks=ceil (ReqMemBits/MemBitsPerBRAM) ; 
fanout  =  NumBlocks; 
if  fanout>128 

fanout  =  128; 

end 

if  fanout  <  1 
fanout  =1; 

end 

MuxLevels=k-14 ; 
if  MuxLevels  <=  0 
MuxLevels=0 ; 

SUP=0 ; 

MuxDelay=0 ; 

else 

[SUPJdUX  MUP_MUX  BUP_MUX  MuxDelay]  = 

HUandDelay (2  AMuxLevels ,  ' MUX ' , WordWidth) ; 

SUP=SUP_MUX*WordWidth; 

end 
MUP=0 ; 

BUP=100*NumBlocks/NumBlockRAM; 

delay  =  tNET (fanout)  +  tBCKO  +  MuxDelay;  %  clk-->data  out  plus 
Setup  time 
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case  {'MEM',  'Mem'} 

Uses  the  type  of  memory  that  requires  the  least  hardware  (HUP) 


[SUP_BRAM  MUP_BRAM  BUP_BRAM  tBRAM]  = 

HUandDelay (n, ' BRAM ', WordWidth) ; 

HUP_BRAM=HUP (SUP_BRAM, MUP_BRAM, BUP_BRAM) ; 

[SUPJLUT  MUPJLUT  BUPJLUT  tLUT]  =  HUandDelay (n, ' LUT ', WordWidth) ; 
HUP_LUT=HUP ( SUPJLUT , MUPJLUT , BUP_LUT ) ; 

if  (HUP_LUT  >  HUP_BRAM) 

BUP=BUP_BRAM; 

MUP=0 ; 

SUP=SUP_BRAM; 

delay=tBRAM; 

else 

BUP=BUP_LUT; 

MUP=0 ; 

SUP=SUP_LUT ; 
delay=tLUT; 

end 


case  'ExtRAM'  %  NOT  CONFIGURED  AT  THIS  TIME 
%  use  Address  Decoder  NumLUTs 
DeviceCLBs=  xxx; 
delay  =  xxx; 
case  'SOP' 

%  This  assumes  a  worst  case  SOP  realization 

numTerms  =  2A (n-1) *WordWidth; 

termSize=n; 

f anout=WordWidth*2 A (n-1) ; 
if  fanout>128 

fanout  =  128; 

end 

if  fanout  <  1 
fanot  =  1; 

end 

numSlices  =  numTerms*ceil (termSize/4 ) /2 ; 

SUP  =  100*numSlices/TotalSlices; 

BUP=0 ; 

MUP=0 ; 

delay  =  tNET (fanout) +tLUT4+tMUXCY  S_0+ (ceil (termSize/4 ) - 
1 ) *tMUXCY  I  0+ (numTerms) *tORCY; 


case  'Multl8xl8' 

%  Imported  Data  removes  I/O  Buffer  gate  delay,  but  leaves  in  tNET 
%  Estimates  for  mulitpliers  are  from  empirical  data. 
maxRadix=17;  %  r  is  the  radix  of  the  multiplier 
nOVERr=ceil (n/maxRadix) ; 

numPPbits=ceil (n/nOVERr ) ;  %  This  finds  the  number  of  bits  of  the 

PPs 


for  a 


PPGoutputBit=numPPbits*2 ;  %  This  is  index  into  multiplier  delays 

%  given  pin  on  the  MULT18xl8,  which  is 
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%  twice  the  number  of  bits  in  the 
%  multiplicands  into  the  MULT18xl8. 

mult  =  importdata (' MultDelayWithNet . txt ')  ; 

[MULTn  MULTt]  =  f illLin (mult ( : , 1 ) , mult ( : , 2 ) ) ; 

f anout=nOVERr ; 

NumMults=nOVERrA2; 

mult  =  importdata (' MultSlices . txt ') ; 

[MULTn  MULTslice]  =  f illLin (mult (:, 1 ), mult (:,  2 ))  ; 
NumSlices=MULTslice (n) ; 

SUP=1 00 *NumSl ices /Total SI ices ; 

MUP=1 00*NumMults/Numl 8x1 8 ; 

BUP=0 ; 

delay=  MULTt (n) ; 
case  'Mult' 

%  Estimations  based  on  architecture  using  CLBs 
Radix=4 ; 

nOVERr=ceil (n/Radix)  ; 

%SlicesPerPPG=4 ;  %  This  assumes  PPGs  8  4-input  LUTs  are  used  for 

each  PPG 

f anout=nOVERr ; 

NumPPGs=nOVERrA2  ; 

NumAdders  =  2* (nOVERr-1) *nOVERr+l; 

AdderDepth  =  2* (nOVERr-1 ) ; 

%  Assumes  each  PPG  is  built  from  a  Radix-bit  function 
[ SUPperPPG  MUP_PPG  BUP_PPG  PPGdelay]  = 

HUandDelay (Radix, ' CLB ' , WordWidth) ; 

SUPperPPG  =  SUPperPPG  *  2*Radix;  %  Each  PPG  requires  2*Radix 
functions 

%  Each  Adder  is  assumed  to  be  a  Radix-bit  adder 
[SUPperAdder  MUP  Adder  BUP  Adder  AdderDelay]  = 

HUandDelay (Radix, ' Adder ', WordWidth) ; 

SUP=NumPPGs* SUPper PPG+ SUPper Adder *NumAdders ; 

MUP=0 ; 

BUP=0 ; 

%Adders  are  assumed  to  occur  in  series  (NOT  the  best  design) 
delay=  PPGdelay+AdderDelay*AdderDepth; 

case  'Adder' 

%  Imported  Data  is  not  utilized  for  adders  since  a  linear  eq.  fits 
%  can  be  shown  imperically  from  Xilinx  ISE  data 
NumSlices=ceil (n/2)  ; 
tRCA  overhead=2 . 52 8 ; 

%  after  analyzing  XILINX  ISE  data,  linear  equation  works  for  n>4 
%  error  in  linear  approximation  is  0  for  n  >  4 
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if  n  <=  2 


delay  = 
elseif  n  <= 
delay  = 
elseif  n  <= 
delay  = 

else 


tILO+tNET (1) ; 

3 

tIFX+tNET (1) ; 

4 

2*tIF5+tNET (1) ; 


delay  =  tMUXCY_I_0* (n-2 ) 

end 


+  tRCA  overhead; 


SUP=  100*NumSlices/TotalSlices; 
MUP=0 ; 

BUP=0 ; 


case  { ' BS ' , ' BarrelShif ter ' } 

%  uses  n  n:l  Muxs  as  most  basic  Barrel  Shifter 

[SUP_MUX  MUP_MUX  BUP_MUX  MuxDelay]  =  HUandDelay (n,  ' MUX '  ,  WordWidth) ; 
fanout  =  n; 

shif tLevels=ceil (log2 (n) ) ; 

SUP=shiftLevels*SUP_MUX; 

MUP=0 ; 

BUP=0 ; 

delay  =  MuxDelay+tNET (fanout) -tNET (1) ; 

%  removes  tNET  for  fanout  of  1  and  inserts  tNET  for  appropriate 

fanout 

case  { ' MUX ' , ' mux ' , ' Mux ' } 

%  This  is  a  n:l  MUX 

NumSlices=ceil (n/4 ) ;  %  checks  with  ISE  data 

mux  =  importdata ( ' MuxDelayWithNet . txt ' ) ; 

[MUXn  MUXt]  =  f illLin (mux ( : , 1 ) , mux ( : , 2 ) ) ; 

%  Imported  Data  removes  I/O  Buffer  gate  delay,  but  leaves  in  tNET 
%  Max  n  to  index  into  MUXt  is  128 
if  n  <=  128 

delay  =  MUXt (n) ;  %  delay  comes  from  imported  ISE  data 

else 

delay  =  2*ceil (log2 (n) ) -14+12 . 1997;  %  estimate  from  equations 

end 

if  n<=2 

delay=tNET (1) +tILO; 

end 

SUP=1 00 *NumSl ices /Total SI ices ; 

MUP=0 ; 

BUP=0 ; 

case  ' S IE ' 

%  SIE  is  assumed  to  be  for  NON-UNIFORM  Segmentation 
%  The  SIE  is  constructed  with  a  LUT  cascade  architecture. 

%  The  timing  a  HW  utilization  is  based  on  the  architecural 
%  description  described  in  the  thesis  with  the  number  address  lines 
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%  input  to  the  memory  is  the  WordWidth. 

k=WordWidth; 
numRails=  k; 

[SUPJLUT  MUPJLUT  BUPJLUT  LUTDelay]  = 

HUandDelay ( k+2 , ' LUT ' , WordWidth) ; 

%  EACH  LUT  is  a  (k+2) input  LUT  with  k  outputs  -->  k  (k+2) input  LUTs 
%  are  used  in  series.  The  HUP  using  LUTs  is  compared  to  the  HUP 
%  using  BRAMs  and  the  one  using  less  hardware  is  chosen. 

[SUP_BRAM  MUP_BRAM  BUPJ3RAM  BRAMDelay]  = 

HUandDelay ( k+2 , ' BRAM ' , WordWidth) ; 

HUP_LUT=HUP ( SUPJLUT , MUP  JLUT , BUP_LUT )  ; 

HUP_BRAM=HUP (SUPJ3RAM, MUP_BRAM, BUPJ3RAM) ; 

if  HUP  JLUT  >  HUPJ3RAM 
SUP=SUP_BRAM; 

BUP=BUP_BRAM; 

MUP=MUP_BRAM; 

else 

SUP=SUP_LUT ; 

BUP=BUP_LUT; 

MUP=MUP_LUT ; 

end 

SUP=SUP*ceil ( (n-k) /2) ; 

BUP=BUP*ceil ( (n-k) / 2) ; 

MUP=MUP*ceil ( (n-k) /2) ; 

delay=LUTDelay*ceil ( (n-k) /2) ; 

otherwise 

SUP  =  'ERROR'; 

BUP  =  'ERROR'; 

MUP  =  'ERROR'; 
delay  =  'ERROR'; 

end 


FILE:  HUP.m 


function  [HUPout]  =  HUP ( SUP, MUP, BUP) 

9'9'9'9-9'9'9'9'9'S-9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9-9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9-9'9'9'9'9'9'9'9-9'9'9'9'9'9-9' 

ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo 

%  HUP.m  % 

o,  o, 

o  o 


o, 

o 

o, 

o 

o, 

o 

o, 

o 

o, 

o 

o, 

o 

o, 

o 

o, 

o 

o, 

o 

o, 

o 


This  function  calculates  the  Hardware  utilization  percentage 
function  [totHUP  totalDelay]  =  HUP ( SUP, MUP, BUP) 


Input:  SUP 

MUP 
BUP 


slice  utilization  percentage  in  %,  max 
MULT18xl8  Utiliazation  Percentage,  max 
BRAM  Utilization  Percentage,  max  100% 


Output:  HUPout: 


Calculated  value  for  HUP. 


100% 

100% 


o, 

o 

o, 

o 

o, 

o 

g, 

o 

g, 

o 

o, 

o 

o, 

o 

o, 

o 

o, 

o 

o, 

o 
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%  Created  by:  Tim  Knudstrup  % 

%  Date:  12  September  2007  % 

o,  o 

o  o 

S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S-S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S' 

ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo 


BS=0  ; 

if  BS  ==  1 


x= [1:1:100 ]  ; 

SUP= [1:1: 100] /100; 

MUP=[ [1:2:100]  [1:2:100] ] /100; 

BUP=[ [1:4:100]  [1:4:100]  [1:4:100]  [1 : 4 : 100] ] /100; 

HUPa  =  1-  (  (1-SUP)  .* (1-MUP)  . * ( 1-BUP) )  . A (1/3) ;%. / sqrt ( (1-SUP)  . * (1-MUP)  . *  (1-BUP) ) ; 
HUPb= (SUP. *MUP. *BUP)  . A (1/3)  ; 
close  all; 

plot (x, SUP, x, MUP, x, BUP, x, HUPa, x, HUPb) 
legend ( ’ SUP ' , ' MUP ' , ' BUP ' , ' HUPa ' , ' HUPb ' ) 

AXIS ( [0  100  0  1] ) 
end 

if  SUP  >  100 
SUP  =  100; 

end 

if  MUP  >  100 
MUP  =  100; 

end 

if  BUP  >  100 
BUP  =  100; 

end 

HUPout=100* (1- ( (l-SUP/100) *abs (l-MUP/100) *abs (l-BUP/100) ) A (1/3) ) ; 


FILE:  HUPBoxes.m 


function  [totHUP  totalDelay]  =  HUPBoxes (components, dependence, compNames) 


S-S-S'S-S-S-S'S-S'S-S'S-S'S-S-S-S'S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S'S-S'S-S'S-S'S-S'S-S'S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S'S-S'S-S'S-S-S-S- 

ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo 

%  HUPBoxes.m 

O, 

o 

%  This  function/program  displays  the  delay  and  percent  hardware 
%  utilization  given  up  to  12  components  and  a  dependence  relationship. 

%  It  is  used  to  show  circuit  components  in  series  and  in  parallel 
%  and  the  combined  delay  of  multiple  components  which  is  dependent  on 
%  one  components  relationship  to  another. 

Q, 

O 

%  function  [totHUP  totalDelay]  =  depBoxes (components, dependence, compNames 

Q, 

O 

%  Input:  components:  nx2  array  of  components  arranged 

%  n  =  row  number  =  the  component  number 

%  Max  number  of  ROWs  is  12 


o,  o, 
o  o 


o, 

o 


o, 

o 
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each  row  contains  : 

[  HUP  timedelay  ] 

dependence:  an  nxn  array  that  defines  the  dependence 

of  the  components. 

For  each  row,  the  array  should  contain  a  1  if 
the  component  number  (row#)  has  to  wait  until 
another  component  is  completed  (in  series) . 

compNames:  an  nxl  column  of  strings,  naming  each  component 

strings  must  be  the  same  length,  can  add  extra 
spaces . 


Output:  totHUP: 


total  percent  of  hardware  used  in  this  circuit 


totalDelay:  total  composite  circuit  delay 


Comments : 


Created  by: 
Date : 


Tim  Knudstrup 
12  September  2007 


numComps=size (components)  ; 
numComps=numComps (1) ; 

close  all; 

%  Color  list  (each  Row  contains  a  different  color  code  (upto  12)) 
Clist  =  [  0.5  0  0 

0  0  0.5 

0  0.5  0 
0.5  0.5  0 
0.5  0  0.5 
0  0.5  0.5 
0.75  0  0 
0  0  0.75 

0  0.75  0 
0.75  0.75  0 
0.75  0  0.75 
0  0.75  0.75]  ; 

compEnds=zeros ( 1 , numComps ) ; 
compStarts=compEnds ; 

compTop=compEnds ; 
compBot=compEnds ; 

for  comp=l : numComps 

if  (sum (dependence (comp, :)) ==0) 
compStarts (comp) =0; 

else 

compDep=find (dependence (comp,  : ) ) ; 
compStarts (comp) =max (compEnds (compDep) ) ; 

end 
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compEnds (comp) =compStarts (comp) +components (comp, 2 ) ; 

end 

compStarts ; 
compEnds ; 

for  comp  =  l:numComps 
if  (comp==l) 

compBot (comp) =0; 

else 

sameStart=f ind (compStarts ( 1 : comp-1 ) ==compStarts (comp) ) ; 
if  isempty (sameStart) 

compDep=find (dependence (comp, : ) ) ; 

[y  indx]  =  max (compEnds (compDep) ) ;  %  finds  index  into 
compBot (comp) =compBot (indx)  ; 

else 

largestTop=max (sameStart) ; 
compBot (comp) =compTop (largestTop) ; 

end 

end 

compTop (comp) =compBot (comp) Icomponents (comp, 1 ) ; 

end 

compBot; 

compTop; 

%  OUTPUT  Data 
totalDelay=max (compEnds) ; 
totHUP=sum (components (:,!)); 

%  Graphs 

for  comp  =  l:numComps 

xVals= [compStarts (comp)  compStarts (comp)  compEnds (comp)  compEnds (comp) ] ; 
yVals= [compBot (comp)  compTop (comp)  compTop (comp)  compBot (comp) ] ; 

colorset=Clist (comp, : ) ; 
fill (xVals, yVals, colorset) 
hold  on 

end 

legend (compNames, ' Location ' , ' EastOutside ' ) 
ylabel ( '  Hardware  Utilization  Percentage') 
xlabel (' Propagation  delay  (ns)') 

temp=cat (2 , ' Total  HUP  =  ' , num2str (totHUP, 4) , ' %,  Total  Propagation  Delay  = 

' , num2str (total Del ay, 4)  ,  '  ns.  ')  ; 
title (temp) 


FILE:  LoadXilinxDeviceData .m 
%************  xiLINX  Virtex-II  6000  Limits  ************* 

%  Most  data  originates  from  Virtex-II  Platform  FPGA  Datasheet  (available  at 
%  www.xlininx.com)  assuming  a  Virtex-II  XC2V6000  device  with  a  speed  grade  of 
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%  -4  (worst  case) . 

%  Data  collected  through  simulation  is  noted. 

%  All  Delay  data  included  here  is  the  worst  case  input  to  output  signal 
%  delay  for  the  particular  device. 

xxx=123456789;  %  This  value  is  not  know  at  this  time 

%********  Available  Memory  ************** 

%  *****  Distributed  SelectRAM  *** 

%  I  am  really  only  concerned  with  ROM 
TotalDistRAM  =  132000; 

TotalDistRAMbits  =  1081344; 

TotalDistRAMbytes  =  TotalDistRAMbits/8; 

t_AS  =  0.5; 

DistRAMDelay  =  xxx  ;  %  in  ns 
tSHCKOl 6  =  2.05; 
tSHCK032  =  2.49; 
tSHCKOF5  =  2.23; 

%  *****  Block  SelectRAM  ********* 

NumBlockRAM  =  144; 

TotalBlockRAM  =  324000; 

TotalBlockRAMbits  =  2654208; 

TotalBlockRAMbytes  =  TotalBlockRAMbits/8 ; 

MemBitsPerBRAM  =  16384; 

BlockRAMdelay  =  2.65;  %  in  ns 
tBCKO  =  2.65; 
tBACK  =  0.36; 

%  *******  ROM  ****************** 

%  Uses  CLB  directly  as  a  function  of  n  inputs 

%  Thus  all  data  is  imperically  determined  from  Xilinx  ISE  Primitives  and 
%  does  not  include  net  delays  or  10  Buffer  delays. 

%  The  delays  are  combinational  from  along  the  longest  delay  path  from 
%  Address  bit  A0  to  the  data  output.  All  times  are  in  ns. 

%  The  primitives  are  actually  RAM  units,  but  only  used  as  ROMs. 

%  These  values  do  not  include  NET  delays,  they  must  be  accounted  for 
%  elsewhere 

tNxlR0M= [0.875  0.875  0.875  0.875  0.875  0.875  1.562  1.879] ; 
tl6xlROM=0 . 875; 

1 3  2  x 1 ROM=  0 . 875; 

1 6  4  x 1 ROM= 1 .562; 
tl2  8xlROM=l . 879; 

Q-'k-k'k-k'k-k-k-k'k-k-k-k'k-k-k-k'k-k-k-k'k-k'k-k'k-k'k-k'k-k-k-k'k-k'k-k'k'k 

o 

%*******  Available  Logic  ************** 

TotalSlices=33792; 

TotalLUTs=67584 ; 

TotalFFs=67584; 

Total Shi f tRegBits=TotalDistRAMbits ; 

MaxSOPChain=l 92 ; 
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MaxCarryChain=17  6; 

%******  CLB  *************************** 

TotalCLBs=TotalLUTs/8; 

CLBdelay4tol  =  0.44;  %  SPEED  GRADE  -4  in  ns 

CLBdelay5tol  =  0.72; 

tILO=0 .44; 

tIF5=0 . 72; 

tIFX=0 . 95; 

tINFXY=0 .45; 

tINAFX=0 . 32; 

tINBFX=0 . 32; 

tSOPSOP=0 .44; 

%  MORE  DATA  AVAILABLE 

9''k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k 

o 

%******  Multipliers  ******************** 

%  Check  to  see  if  Enhanced  or  not  ! ! ! 

Numl8xl 8=144; 

%  These  are  the  worst  case  in  to  out  delays  using  the  entire  multiplier 
Delayl8xl8=10 . 36;  %  in  ns 

Delayl 8x1 8Enh=5 . 91 ;  % 

%  The  DELAY  can  be  reduced  if  the  entire  18x18  Mult  is  not  used 

%  See  Page  22  of  Module  3  in  Xilinx  Datasheet 

%  Index  into  the  array  is  offset  by  1 

tMULT  =  [3.12;  3.32;3.53;3.74;3.94;4.15;4.36;4.56; 

4. 77;  4. 98; 5. 19; 5. 39; 5. 6; 5. 81; 6. 01; 6. 22; 6. 43; 6. 63; 
6.84;7.05;7.26;7.46;7.67;7.88;8.08;8.29;8.5;8.7; 

8 . 91 ; 9 . 12 ; 9 . 33 ; 9 . 53 ; 9 . 74 ; 9 . 95 ; 1 0 . 15 ; 1 0 . 36] ; 

Q~'k-k-k-k'k-k'k-k'k-k'k-k'k-k'k-k'k-k'k-k'k-k'k-k'k-k'k-k'k-k'k-k'k-k'k-k'k-k'k 

o 

%******  Routing  Delays  ***************** 

tIBUF=0 . 825; 
tOBUF=4 .361; 


8''k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k 

o 

%******  I/O  Pads  *********************** 

TotalIOpads=l 1 04 ; 

IOpadDelay=  100;  %  in  ns 

8''k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k 

o 

9''k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k 

o 

%  EMPIRICAL  DATA  COLLECTED 

%  This  creates  an  array  tNET  from  empirical  data  supported  by  Xilinx 
%  Datasheets. 

%  *****  NET  DELAYS  ******** 

data  in  =  importdata ('NetDelay.txt'); 
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[fanout  tNET]  =  fillLin (data  in(:,l),data  in(:,2)); 


%  ****  SPECIAL  MUX  DELAYS  ** 

tMUXCY_I_0  =  0.053;  %  fast  carry  MUX  prop  delay 
%  This  data  is  reported  to  be 
tMUXCY  S  0  =  0.298; 


from  input  10  to  output 
0.05  ns  in  Datasheet. 


0. 


%  ****  LOGIC  COMPONENT  DELAYS 
tLUT4  =  0.439; 
tORCY  =  0.44; 


9''k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k 

o 


o 

plot_on=0 ; 
if  plot  on  ==  1 

stem (data_in ( : , 1 ) , data_in ( : , 2 ) , ' bo- ' ) 
hold  on 

plot (fanout, tNET, ' g. - ' ) 
xlabel ( ' fanout ' ) 
ylabel('Net  Delay') 

legend (' Collected  Data  Points FillLine  Data  Points'); 

end 


FILE:  model  Linear  NonUniform  Basic. m 


function  [totHUP  totDelay]  =  model  Linear  NonUniform  Basic (n, numSegs) 

S'S-S-S-S'S-S-S-S'S-S-S-S'S-S-S-S'S-S'S-S'S-S'S-S'S-S-S-S'S-S-S-S'S-S-S-S'S-S-S-S-S-S-S-S'S-S'S-S'S-S'S-S-S-S'S-S'S-S-S-S'S-S'S-S'S-S-S-S'S-S- 

OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO' 

%  model  Linear  NonUniform  Basic. m 

O, 

o 

%  This  function  produces  the  HUP  and  delay  for  a  model  of  a  linear  NFG 
%  using  nonuniform  segmentation. 

O, 

o 

%  function  [totHUP  totDelay]  =  model  Linear  NonUniform  Basic (n, numSegs ) 


Input : 


n : 


numSegs : 


Output:  totHUP: 

totalDelay : 

Comments : 


number  of  bits  in  the  system 

number  of  segments  in  the  memory 

hardware  utilization  percentage 
total  composite  circuit  delay 


Created  by: 
Date : 


Tim  Knudstrup 
25  September  2007 


k=ceil (log2 (numSegs) ) ; 
WordWidth=2  *n ; 


number  of  address  lines  to  the  coefficients  Memory 
2  n-bit  numbers  are  stored  in  the  Coefficients  Memory 
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[ SUP_S IE  MUP^SIE  BUP_S IE  tSIE]  =  HUandDelay (n, ' SIE ' , k) ; 

[SUP  mult  MUP  mult  BUP  mult  tMult]  =  HUandDelay (n,  ' Multi  8x1 8 ', WordWidth) ; 
[SUP  mem  MUP  mem  BUP  mem  tMem]  =  HUandDelay ( k, ' MEM ', WordWidth) ; 

[SUP  add  MUP  add  BUP  add  tAdd]  =  HUandDelay (2*n, ' Adder ', WordWidth) ; 

HUP_SIE=  HUP(SUP_SIE,  MUP_SIE,  BUP^SIE) ; 

HUP_mult=  HUP (SUP_mult,MUP_mult,BUP_mult) ; 

HUP  mem=  HUP (SUP  mem,  MUP  mem,  BUP  mem) ; 

HUP^add=  HUP ( SUP^add,  MUP^add,  BUP^add) ; 

devicel  =  [HUP_SIE  tSIE]; 
device2  =  [HUP  mem  tMem] ; 
device3  =  [HUP  mult  tMult] ; 
device4  =  [HUP_add  tAdd] ; 

dependency=  [0000 
10  0  0 
0  10  0 
0  0  1  0  ]  ; 

components  =  [devicel ; device2 ; device3 ; device4 ]  ; 
compNames  =  [  'SIE 

'Memory  ' 

'Multiplier  ' 

' Adder  ' ] ; 


graphON=0 ; 
if  graphON  ==  1; 

[totHUP  totDelay]  =  HUPBoxes (components, dependency, compNames) ; 

else 

[totHUP  totDelay]  =  totalHUPandDelay (components, dependency, compNames) ; 

end 


FILE:  model  Linear  Uniform  Basic. m 


function  [totHUP  totDelay]  =  model  Linear  Uniform  Basic (n, numSegs) 

5'9'2'9'9'9'5'9'2'9'9'9'2'9'9'9'2'9'5'9'2'9'2'9'9'9'5'9'2'9'9'9'2'9'9'9'2'9'5'9'2'9'5'9'9'9'9'9'9'9'2'S-2'9'9'9'9'9'9'S-2'9'5'S-9'9'9'9'9'9'0 

ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooc 

%  model  Linear  Uniform  Basic. m 

O, 

o 

%  This  function  produces  the  HUP  and  delay  for  a  model  of  a  linear  NFG 
%  using  uniform  segmentation. 

O, 

o 

%  function  [totHUP  totDelay]  =  model  Linear_Unif orm  Basic (n, numSegs ) 

O, 

o 

%  Input:  n:  number  of  bits  in  the  system 

o, 

o 

%  numSegs:  number  of  segments  in  the  memory 

O, 

o 

%  Output:  totHUP:  hardware  utilization  percentage 

%  totalDelay:  total  composite  circuit  delay 
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o, 

o 

o, 

o 

o, 

o 

o, 

o 

g, 

o 


Comments : 

Created  by: 
Date : 


Tim  Knudstrup 
25  September  2007 


9'9'9'9-9'9'9'9'9'9'9'9-9'9'9'9'9'9'9'S-9'9'9'9'9'9'9'S-9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'S-9'9'9'9'9'9'9'S-9'9'9'9'9'S-9' 

ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo 


k=ceil ( log2 (numSegs ) ) ;  %  number  of  address  lines  to  the  coefficients  Memory 

WordWidth=2 *n;  %  2  n-bit  numbers  are  stored  in  the  Coefficients  Memory 

[SUP  mult  MUP  mult  BUP  mult  tMult]  =  HUandDelay (n,  ' Multi  8x18 ', WordWidth) ; 

[SUP  mem  MUP  mem  BUP  mem  tMem]  =  HUandDelay ( k, ' MEM ', WordWidth) ; 

[SUP  add  MUP  add  BUP  add  tAdd]  =  HUandDelay (2*n, ' Adder ', WordWidth) ; 

HUP_mult=  HUP (SUP_mult,MUP_mult,BUP_mult)  ; 

HUP  mem=  HUP (SUP  mem,  MUP  mem,  BUP  mem); 

HUP~add=  HUP ( SUP~add,  MUP~add,  BUP~add) ; 


devicel 

device2 

device3 


[HUP  mem  tMem] ; 
[HUP  mult  tMult] ; 
[HUP  add  tAdd] ; 


dependency= 


components  = 
compNames  = 


[0  0  0 
10  0 
0  10]; 

[devicel ; device2 ; device3 ] ; 
[  'Memory 

'Multiplier  ' 

' Adder  ' ] ; 


graphON  =  0; 
if  graphON  ==  1; 

[totHUP  totDelay]  =  HUPBoxes (components, dependency, compNames) ; 

else 

[totHUP  totDelay]  =  totalHUPandDelay (components, dependency, compNames) ; 

end 


FILE:  model  Linear  NonUniform  Compact. m 


function  [totHUP  totDelay]  =  model  Linear  NonUniform  Compact (n, numSegs) 


5'9'2'9'9'9'9'9'9'9'5'9'9'9'5'9'9'9'9'S-9'9'5'S-9'9'5'9'2'9'9'9'9'9'5'9'2'9'9'9'2'9'9'9'9'9'2'9'9'9'2'S-2'9'5'9'2'9'5'9'9'9'5'9'5'9'9'9'2'9'9'9'9'9'5' 

ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo 


%  model  Linear  NonUnif orm_Compact .m 


o, 

o 

o, 

o 

Q, 

O 

O, 

O 

O, 

O 


This  function  produces  the  HUP  and  delay  for  a  model  of  a  compact 
linear  NFG  using  nonuniform  segmentation. 

function  [totHUP  totDelay]  =  model  Linear  NonUniform  Compact (n, numSegs) 


g, 

o 

o, 

o 

Q, 

o 

o, 

o 

o, 

o 
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o, 

o 

o, 

o 

g, 

o 

o, 

o 

o, 

o 

o, 

o 

g, 

o 

g, 

o 

o, 

o 

g, 

o 

g, 

o 

g, 

o 


Input : 


n:  number  of  bits  in  the  system 


numSegs : 


number  of  segments  in  the  memory 


Output:  totHUP: 

totalDelay : 

Comments : 


hardware  utilization  percentage 
total  composite  circuit  delay 


Created  by:  Tim  Knudstrup 

Date:  25  September  2007 


g, 

o 

o, 

o 

o, 

o 

g, 

o 

o, 

o 

g, 

o 

g, 

o 

o, 

o 

o, 

o 

o, 

o 

o, 

o 

g, 

o 


5'9'2'9'2'9'9'9'5'9'9'9'9'9'9'9'2'9'9'9'9'9'9'9'2'9'9'9'5'9'5'9'5'9'9'9'9'9'5'9'5'9'9'9'2'9'9'9'2'9'9'9'2'9'9'9'2'9'9'9'2'9'5'9'9'9'5'9'9'9'9'9'2'S-9' 

ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo 


k=ceil ( log2 (numSegs )) ;  %  number  of  address  lines  to  the  coefficients  Memory 

q=n/2;  %  This  is  just  an  assummed  value. 

WordWidth=3*n-q;  %  bits  per  word  in  Coefficients  Memory. 


[ SUP_S IE  MUP__S IE  BUP_S IE  tSIE]  =  HUandDelay  (n,  '  SIE  '  ,  k)  ; 

[SUP  mult  MUP  mult  BUP  mult  tMult]  = 

HUandDelay (ceil (n/2 ) , ' Mult 18x1 8 ' , WordWidth) ; 

[SUP  mem  MUP  mem  BUP  mem  tMem]  =  HUandDelay ( k, ' MEM WordWidth) ; 
[SUP  add  MUP  add  BUP  add  tAdd]  =  HUandDelay (n, ' Adder WordWidth) ; 

HUP_SIE=  HUP(SUP_SIE,  MUP_SIE,  BUP^SIE) ; 

HUP_mult=  HUP(SUP_mult,MUP_mult,BUP_mult) ; 

HUP  mem=  HUP (SUP  mem,  MUP  mem,  BUP  mem) ; 

HUP^add=  HUP ( SUP~add,  MUP^add,  BUP^add) ; 


devicel  =  [HUP_SIE  tSIE] ; 
device2  =  [HUP  mem  tMem] ; 
device3  =  [HUP_add  tAdd] ; 
device4  =  [HUP  mult  tMult] ; 
device5  =  [HUP  add  tAdd] ; 


dependency=  [00000 
1  0  0  0  0 
0  10  0  0 
0  110  0 
01010]; 

components  =  [devicel ; device2 ; device3 ; device4 ; device5 ] ; 
compNames  =  [  'SIE 

'Memory  ' 

' Adder 1  ' 

'Multiplier  ' 

' Adder 2  ' ] ; 


graphON  =  0; 
if  graphON  ==  1; 
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[totHUP  totDelay]  =  HUPBoxes (components, dependency, compNames) ; 

else 

[totHUP  totDelay]  =  totalHUPandDelay (components, dependency, compNames) ; 

end 


FILE:  model  Linear  Uniform  Compact. m 


function  [totHUP  totDelay]  =  model  Linear  Uniform  Compact (n, numSegs) 

9'9'9'9-9'9-9'9'9'9'9'9-9'9'9'9-9'9'9'9-9'9-9'9'9'9'9'9'9'9-9'9-9'9'9'9-9'9-9'9'9'9'9'9-9'9-9'9-9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9' 

oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo 

%  model  Linear  Uniform  Compact. m 

O, 

o 

%  This  function  produces  the  HUP  and  delay  for  a  compact  model  of  a 
%  linear  NFG  using  uniform  segmentation. 

O, 

o 

%  function  [totHUP  totDelay]  =  model  Linear  Uniform  Compact (n, numSegs ) 


Input : 


numSegs : 


Output:  totHUP: 

totalDelay : 

Comments : 


number  of  bits  in  the  system 

number  of  segments  in  the  memory 

hardware  utilization  percentage 
total  composite  circuit  delay 


Created  by: 
Date : 


Tim  Knudstrup 
25  September  2007 


k=ceil ( log2 (numSegs )) ;  %  number  of  address  lines  to  the  coefficients  Memory 
WordWidth=k+n;  %  2  n-bit  numbers  are  stored  in  the  Coefficients  Memory 

[SUP  mult  MUP  mult  BUP  mult  tMult]  = 

HUandDelay (ceil (n/2 ) ,  ' Multi  8x1 8 ' , WordWidth) ; 

[SUP  mem  MUP  mem  BUP  mem  tMem]  =  HUandDelay ( k,  ' MEM ', WordWidth) ; 

[SUP  add  MUP  add  BUP  add  tAdd]  =  HUandDelay (n, ' Adder ', WordWidth) ; 

HUP_mult=  HUP  (SUP_mult,MUP_mult,  BUP_mult)  ; 

HUP  mem=  HUP (SUP  mem,  MUP  mem,  BUP  mem) ; 

HUP^add=  HUP(SUP~add,  MUP^add,  BUP^add) ; 

devicel  =  [HUP  mem  tMem] ; 
device2  =  [HUP  mult  tMult] ; 
device3  =  [HUP_add  tAdd] ; 

dependency=  [000 
10  0 
110]; 

components  =  [devicel ; device2 ; device3 ] ; 
compNames  =  [  'Memory 

'Multiplier  ' 

' Adder  ' ] ; 
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graphON  =  0 ; 
if  graphON  ==  1; 

[totHUP  totDelay]  =  HUPBoxes (components, dependency, compNames) ; 

else 

[totHUP  totDelay]  =  totalHUPandDelay (components, dependency, compNames) ; 

end 


FILE:  model  Quad  NonUniform  Basic. m 


function  [totHUP  totDelay]  =  model  Quad  NonUniform  Basic (n, numSegs) 

S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S-S-S'S-S'S-S'S-S'S-S-S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S'S-S' 
ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo 

%  model  Quad  NonUniform  Basic. m 

O, 

o 

%  This  function  produces  the  HUP  and  delay  for  a  model  of  a  quadratic  NFG 
%  using  nonuniform  segmentation. 

Q, 

O 

%  function  [totHUP  totDelay]  =  model  Quad  NonUniform  Basic (n, numSegs ) 


Input : 


numSegs : 


Output:  totHUP: 

totalDelay : 

Comments : 


number  of  bits  in  the  system 

number  of  segments  in  the  memory 

hardware  utilization  percentage 
total  composite  circuit  delay 


Created  by: 
Date : 


Tim  Knudstrup 
25  September  2007 


k=ceil ( log2 (numSegs )) ;  %  number  of  address  lines  to  the  coefficients  Memory 

WordWidth=3*n;  %  3  n-bit  numbers  are  stored  in  the  Coefficients  Memory 


[ SUP_S IE  MUP^SIE  BUP_S IE  tSIE]  =  HUandDelay (n, ' SIE ' , k) ; 

[SUP  mem  MUP  mem  BUP  mem  tMem]  =  HUandDelay ( k, ' MEM ' , WordWidth) ; 

[SUP_mult_2N  MUP_mult_2N  BUP_mult_2N  tMult_2N]  = 

HUandDelay (n,  ' Multi 8x1 8 ' , WordWidth)  ; 

[SUP_mult_3N  MUP_mult_3N  BUP_mult_3N  tMult_3N]  = 

HUandDelay (ceil(1.5*n),  ' Multi  8x1 8 ' , WordWidth) ; 

[SUP  add  2N  MUP  add  2N  BUP  add  2N  tAdd  2N]  =  HUandDelay (2*n, ' Adder ', WordWidth) ; 
[SUP  add  3N  MUP  add  3N  BUP  add  3N  tAdd  3N]  =  HUandDelay (3*n, ' Adder WordWidth) ; 

HUP_SIE=  HUP(SUP_SIE,  MUP_SIE,  BUP^SIE) ; 

HUP  mem=  HUP (SUP  mem,  MUP  mem,  BUP  mem) ; 

HUP_mult_2N  =  HUP ( SUP_mult_2N, MUP_mult_2N, BUP_mult_2N) ; 

HUP  mult  3N  =  HUP ( SUP  mult  3N,MUP  mult  3N,BUP  mult  3N) ; 
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HUP_add_2N=  HUP (SUP_add_2N,  MUP_add_2N,  BUP_add_2N) ; 
HUP  add  3N=  HUP (SUP  add  3N,  MUP  add  3N,  BUP  add  3N) ; 


devicel 

device2 

device3 

device4 

device5 

device6 

device7 


[HUP_SIE  tSIE]; 

[HUP  mem  tMem] ; 

[ HU  P_mu 1 1_2  N  tMult_2N] ; 
[ HU  P_mu 1 1_2  N  tMult_2N] ; 
[HUP_mult_3N  tMult_3N] ; 
[HUP_add_2N  tAdd_2N] ; 
[HUP  add  3N  tAdd  3N] ; 


dependency=  [0000000 
1  0  0  0  0  0  0 

0  0  0  0  0  0  0 

0  1  0  0  0  0  0 

0  1  1  0  0  0  0 

0  10  10  0  0 
0  0  0  0  1  1  0  ]  ; 


components  =  [devicel ; device2 ; device3 ; device4 ; device5 ;  device6;  device7]; 
compNames  =  [  'SIE 

'Coeff.  Table  ' 

'Multiplier  1  ' 

'Multiplier  2  ' 

'Multiplier  3  ' 

'Adder  1  ' 

' Adder  2  ' ] ; 


graphON  =  0 ; 
if  graphON  ==  1; 

[totHUP  totDelay]  =  HUPBoxes (components, dependency, compNames) ; 

else 

[totHUP  totDelay]  =  totalHUPandDelay (components, dependency, compNames) ; 

end 


FILE:  model  Quad  Uniform  Basic. m 


function  [totHUP  totDelay]  =  model  Quad  Uniform  Basic (n, numSegs) 

9'9'9'9'9'9'9'9'9'9'9'9'9'S-9'9'9'9'9'9'9'S-9'9'9'9'9'9'9'9-9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'9-9'9'9'9'9'9'9'9' 

oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo 

%  model  Quad  Uniform  Basic. m 

o, 

o 

%  This  function  produces  the  HUP  and  delay  for  a  basic  model  of  a 
%  quadratic  NFG  using  uniform  segmentation. 

O, 

o 

%  function  [totHUP  totDelay]  =  model  Quad  Uniform  Basic (n, numSegs) 

o, 

o 

%  Input:  n:  number  of  bits  in  the  system 

O, 

o 

%  numSegs:  number  of  segments  in  the  memory 
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%  Output:  totHUP:  hardware  utilization  percentage 

%  totalDelay:  total  composite  circuit  delay 

%  Comments: 

O, 

o 

%  Created  by:  Tim  Knudstrup 
%  Date:  25  September  2007 

g, 

o 


o, 

o 

o, 

o 

g, 

o 

g, 

o 

o, 

o 

g, 

o 

g, 

o 


9'9'9'9'9'9'9'9'9'9-9'9'9'9'9'S-9'9'9'&-9'9'9'S-9'9'9'9'9'9'9'9'9'9-9'9'9'9'9'9'9'9-9'9'9'9-9'S-9'9'9'9'9'9-9'9'9'9'9'9'9'9'9'9'9'9'9'9'9'S-9'9'9'9'9' 

ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo 


k=ceil ( log2 (numSegs ) ) ;  %  number  of  address  lines  to  the  coefficients  Memory 

WordWidth=3*n;  %  3  n-bit  numbers  are  stored  in  the  Coefficients  Memory 


% [ SUP_SIE  MUP_SIE  BUP_SIE  tSIE]  =  HUandDelay (n, ' SIE ', k) ; 

[SUP  mem  MUP  mem  BUP  mem  tMem]  =  HUandDelay ( k,  ' MEM '  ,  WordWidth) ; 

[SUP_mult_2N  MUP_mult_2N  BUP_mult_2N  tMult_2N]  = 

HUandDelay (n,  ' Multi  8x1 8 ' , WordWidth) ; 

[ SUP_mult_3N  MUP_mult_3N  BUP_mult_3N  tMult_3N]  = 

HUandDelay (ceil(1.5*n), ' Mult 18x1 8 ' , WordWidth) ; 

[SUP  add  2N  MUP  add  2N  BUP  add  2N  tAdd  2N]  =  HUandDelay (2*n, ' Adder WordWidth) ; 

[SUP  add  3N  MUP  add  3N  BUP  add  3N  tAdd  3N]  =  HUandDelay (3*n, ' Adder WordWidth) ; 

%HUP_S IE=  HUP ( SUP_S IE ,  MUP_SIE,  BUP_SIE) ; 

HUP  mem=  HUP (SUP  mem,  MUP  mem,  BUP  mem) ; 

HUP_mult_2N  =  HUP(SUPj-nult_2N,MUP_mult_2N,BUP_mult_2N)  ; 

HUP_mult_3N  =  HUP(SUPjnult_3N,MUP_mult_3N,BUP_mult_3N) ; 

HUP_add_2N=  HUP (SUP_add_2N,  MUP_add_2N,  BUP_add_2N) ; 

HUP_add_3N=  HUP ( SUP_add_3N,  MUP_add_3N,  BUP_add_3N) ; 

%devicel  =  [HUP_SIE  tSIE] ; 
devicel  =  [HUP  mem  tMem] ; 
device2  =  [HUP  mult  2N  tMult  2N] ; 

device3  =  [HUP  mult  2N  tMult  2N] ; 

device4  =  [HUP  mult  3N  tMult  3N] ; 

device5  =  [HUP  add  2N  tAdd  2N] ; 
device6  =  [HUP_add_3N  tAdd_3N] ; 

dependency=  [000000 
0  0  0  0  0  0 

1  0  0  0  0  0 

1  1  0  0  0  0 

10  10  0  0 

0  0  0  1  1  0  ]  ; 

components  =  [devicel ; device2 ; device3 ; device4 ; device5 ;  device6;]; 
compNames  =  [  'Coeff.  Table  ' 

'Multiplier  1  ' 

'Multiplier  2  ' 

'Multiplier  3  ' 

'Adder  1  ' 
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' Adder  2  ' ] ; 


graphON  =  0 ; 
if  graphON  ==  1; 

[totHUP  totDelay]  =  HUPBoxes (components, dependency, compNames) ; 

else 

[totHUP  totDelay]  =  totalHUPandDelay (components, dependency, compNames) ; 

end 


FILE:  model  Quad  NonUniform  Compact. m 


function  [totHUP  totDelay]  =  model  Quad  NonUnif orm_Compact (n, numSegs) 

9'9'2'9'2'9'9'9'9'9'9'9'2'9'2'9'9'9'2'9'2'9'2'9'2'9'2'9'9'9'9'9'9'9'9'9'9'S-9'9'9'S-5'9'2'S-5'9'2'9'9'9'2'9'9'9'5'9'9'9'9'S-5'9'5'9'9'9'9'9'9' 

ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo' 

%  model  Quad  NonUnif orm_Compact .m 

O, 

o 

%  This  function  produces  the  HUP  and  delay  for  a  model  of  a  compact 
%  quadratic  NFG  using  nonuniform  segmentation. 

O, 

o 

%  function  [totHUP  totDelay]  =  model  Quad  NonUniform  Compact (n, numSegs ) 


Input : 


n : 


numSegs : 


Output:  totHUP: 

totalDelay : 

Comments : 


number  of  bits  in  the  system 

number  of  segments  in  the  memory 

hardware  utilization  percentage 
total  composite  circuit  delay 


Created  by: 
Date : 


Tim  Knudstrup 
25  September  2007 


k=ceil (log2 (numSegs) ) ; 

ql=n/2; 

q2=n/ 2 ; 

WordWidth=4  *n-ql-q2 ; 


%  number  of  address  lines  to  the  coefficients  Memory 
%  these  are  just  example  q's 

%  Coefficients  Memory 


[ SUP_S IE  MUP__S IE  BUP_S IE  tSIE]  =  HUandDelay  (n,  '  SIE  '  ,  k)  ; 

[SUP  mem  MUP  mem  BUP  mem  tMem]  =  HUandDelay ( k,  ' MEM ' , WordWidth) ; 
[SUP  mult  q  MUP  mult  q  BUP  mult  q  tMult  q]  = 

HUandDelay (ceil (q2 /2 ) ,  ' Multi  8x18 ' , WordWidth) ; 

[ SUP_mult_N  MUP_mult_N  BUPjnult_N  tMult_N]  = 

HUandDelay (ceil (n/2 ) , ' Mult 18x1 8 ' , WordWidth) ; 

[SUP  add  MUP  add  BUP  add  tAdd]  =  HUandDelay (n, ' Adder ', WordWidth) ; 
HUP  mem=  HUP (SUP  mem,  MUP  mem,  BUP  mem) ; 
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HUP_SIE=  HUP ( SUP_S IE ,  MUP^SIE,  BUP_SIE) ; 

HUP  mult  q  =  HUP(SUP  mult  q,MUP  mult  q,BUP  mult  q) ; 
HU  P  jnu 1 t_N  =  HU  P ( S  U  P_mu 1 t_N , MU  P_mu 1 t_N , BU  P_mu 1 t_N ) ; 
HUP  add=  HUP (SUP  add,  MUP  add,  BUP  add); 


devicel 

device2 

device3 

device4 

device5 

device6 

device7 

device8 


[HUP_SIE  tSIE]; 

[HUP  mem  tMem] ; 
[HUP_add  tAdd] ; 

[HUP  mult  q  tMult  q] ; 
[HUP_mult_N  tMult_N] ; 
[HUP_mult_N  tMult_N] ; 
[HUP_add  tAdd] ; 

[HUP  add  tAdd] ; 


dependency= 


[0  0000000 
10000000 
01000000 
00100000 
01100000 
01010000 
01001000 
00000110]; 


components  =  [devicel ; device2 ; device3 ; device4 ; device5 ;  device6;  device7; 
device 8] ; 

compNames  =  [  '  SIE 

'Coeff.  Table  ' 

'Adder  1  ' 

'Multiplier  1  ' 

'Multiplier  2  ' 

'Multiplier  3  ' 

'Adder  2  ' 

' Adder  3  ' ] ; 


graphON  =  0 ; 
if  graphON  ==  1; 

[totHUP  totDelay]  =  HUPBoxes (components, dependency, compNames) ; 

else 

[totHUP  totDelay]  =  totalHUPandDelay (components, dependency, compNames) ; 

end 


FILE:  model  Quad  Uniform  Compact. m 


function  [totHUP  totDelay]  =  model  Quad  Uniform  Compact (n, numSegs) 


model  Quad  Unif orm_Compact .m 

This  function  produces  the  HUP  and  delay  for  a  model  of  a 
compact  quadratic  NFG  using  uniform  segmentation. 


function  [totHUP  totDelay]  =  model  Quad  Uniform  Compact (n, numSegs) 
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o, 

o 

o, 

o 

g, 

o 

o, 

o 

o, 

o 

o, 

o 

g, 

o 

g, 

o 

o, 

o 

g, 

o 

g, 

o 

g, 

o 


Input : 


n:  number  of  bits  in  the  system 


numSegs : 


number  of  segments  in  the  memory 


Output:  totHUP: 

totalDelay : 

Comments : 


hardware  utilization  percentage 
total  composite  circuit  delay 


Created  by:  Tim  Knudstrup 

Date:  25  September  2007 


g, 

o 

o, 

o 

o, 

o 

g, 

o 

o, 

o 

g, 

o 

g, 

o 

o, 

o 

o, 

o 

o, 

o 

o, 

o 

g, 

o 


5'9'2'9'2'9'9'9'5'9'9'9'9'9'9'9'2'9'9'9'9'9'9'9'2'9'9'9'5'9'5'9'5'9'9'9'9'9'5'9'5'9'9'9'2'9'9'9'2'9'9'9'2'9'9'9'2'9'9'9'2'9'5'9'9'9'5'9'9'9'9'9'2'S-9' 

ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo 


k=ceil (log2 (numSegs) ) ; 
ql=n/ 2 ; 
q2=n/ 2 ; 

WordWidth=4  *n-ql-q2 ; 


%  number  of  address  lines  to  the  coefficients  Memory 
%  these  are  just  example  q's 

%  Coefficients  Memory 


[SUP  mem  MUP  mem  BUP  mem  tMem]  =  HUandDelay ( k, ' MEM ' , WordWidth) ; 
[SUP  mult  q  MUP  mult  q  BUP  mult  q  tMult  q]  = 

HUandDelay (ceil (q2 /2 ) ,  ' Multi  8x1 8 ' , WordWidth) ; 

[ SUP_mult_N  MUP_mult_N  BUPjnult_N  tMult_N]  = 

HUandDelay (ceil (n/2 ) , ' Mult 18x1 8 ' , WordWidth) ; 

[SUP  add  MUP  add  BUP  add  tAdd]  =  HUandDelay (n, ' Adder WordWidth) ; 

HUP  mem=  HUP (SUP  mem,  MUP  mem,  BUP  mem) ; 

HUP  mult  q  =  HUP(SUP  mult  q,MUP  mult  q,BUP  mult  q) ; 

HUP^mult_N  =  HUP ( SUP~mult  N,MUP  mult_N, BUP_mult_N) ; 


HUP  add=  HUP (SUP  add,  MUP  add,  BUP  add)  ; 


devicel 

device2 

device3 

device4 

device5 

device6 

device7 


[HUP  mem  tMem] ; 
[HUP_add  tAdd] ; 

[HUP  mult  q  tMult  q] ; 
[HUP_mult_N  tMult_N] ; 
[HUP_mult_N  tMult_N] ; 
[HUP_add  tAdd] ; 

[HUP  add  tAdd] ; 


dependency=  [0000000 
1  0  0  0  0  0  0 

0  1  0  0  0  0  0 

1  1  0  0  0  0  0 

1  0  1  0  0  0  0 

1  0  0  1  0  0  0 

0000110]; 


components  =  [devicel ; device2 ; device3 ; device4 ; device5 ;  device6; device7 ] ; 
compNames  =  [  ' Coeff.  Table  ' 

'Adder  1  ' 

'Multiplier  1  ' 
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'Multiplier  2  ' 
'Multiplier  3  ' 
'Adder  2  ' 

' Adder  3  ' ] ; 


graphON  =  0; 
if  graphON  ==  1; 

[totHUP  totDelay]  =  HUPBoxes (components, dependency, compNames) ; 

else 

[totHUP  totDelay]  =  totalHUPandDelay (components, dependency, compNames) ; 

end 


FILE:  mylnt.m 

function  [intVal]=  mylnt(f  symbol, a, b) 

%  This  function  returns  an  approximation  for  the  integral  of  the  symbolic 
%  function  over  the  interval  a  to  b.  The  approximation  is  calculated 
%  using  trapezoidal  integration  approximation. 

numPts=10000; 

rez= (b-a) / numPts; 

X= [ a : re  z : b ] ; 

y= ( subs ( f_symbol , X) ) ; 

totSum  =  0; 
width=  X(2)-X(l); 

for  ii=l : length (X) -1 

incSum=  width *y (ii) +0 . 5* width* (y (ii+1) -y(ii) ) ; 
totSum=totSum+incSum; 

end 

intVal=totSum; 


FILE:  pickModel.m 

function  [totHUP  totDelay]  =  pickModel (modelNum, n, segs) ; 

%  This  function  returns  the  total  HUP  and  Delay  for  a  function 
%  implemented  on  an  NFG  model  chosen  by  'modelNum.' 

%  The  default  model  is  the  basic  linear  NFG  with  uniform  segmentation 

%  (LUB) . 

switch  modelNum 
case  1 

[totHUP  totDelay]  =  model  Linear  Uniform  Basic (n, segs); 
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case  2 

[totHUP 
case  5 

[totHUP 
case  6 

[totHUP 
case  3 

[totHUP 
case  4 

[totHUP 
case  7 

[totHUP 
case  8 

[totHUP 

otherwise 

[totHUP 


totDelay] 

totDelay] 

totDelay] 

totDelay] 

totDelay] 

totDelay] 

totDelay] 

totDelay] 


end 


model  Linear  NonUniform  Basic (n, segs) ; 
model  Linear  Uniform  Compact (n, segs) ; 
model  Linear  NonUniform  Compact (n, segs ) ; 
model  Quad  Uniform  Basic (n, segs) ; 
model  Quad  NonUniform  Basic (n, segs) ; 
model  Quad  Uniform  Compact (n, segs) ; 
model  Quad  NonUniform  Compact (n, segs) ; 
model  Linear  Uniform  Basic (n, segs) ; 


FILE:  segments. m 


function  [numSegs]  =  segments (f, xmin, xmax, n) 

i-Q-S-2-9-2-9-2-9-2-9-0 
dooooooooooc 

segments .m 

This  function  returns  the  number  of  required  segments  for  LU,  LN,  QU, 
QN  NFGs  for  a  given  function  (f)  on  an  interval  [xmin, xmax]  for  a 
with  n  bits. 

function  [numSegs]  =  segments ( f, xmin, xmax, n) 


Input : 


xmrn, xmax 


string  value  of  a  function  of  x 
NFG  domain 

number  of  system  bits,  precision 


Output : 


numSegs:  4  by  1  vector  returning  the  number  oi 

segments  for  [LU; LN; QU; QB]  NFGs 


Comments : 


Created  by: 
Date : 


Tim  Knudstrup 
20  September  2007 


clear  numSegs  numSegsLin  NONUNIFORM  numSegsLin  UNIFORM  ; 
clear  numSegsQuad  NONUNIFORM  numSegsQuad  UNIFORM; 
clear  SegsLin  SegsQuad; 
func  =  inline (f); 

syms  'x'  %  ' epps '  %  'a'  ' b' 

a=xmin; 
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b=xmax; 

epps=2  A ( -n-1 ) ; 
f  of  x=func (x) ; 

FirstDeriv=  diff(f  of  x, 'x'); 

SecondDeriv=  diff (FirstDeriv,  'x' )  ; 

sqrt  2ndDeriv=sqrt ( ( SecondDeriv) ) ; 

%SegsLin  =  abs ( 0 . 25*int ( ( sqrt_2ndDeriv) , ' x ' , a, b) /sqrt (epps ) ) 

numSegsLin  NONUNIFORM  =  ceil ( 0 . 25*mylnt (abs ( sqrt  2ndDeriv) , a, b) /sqrt (epps ) ) ; 
thirdDeriv=dif f (SecondDeriv,  'x' )  ; 

%SegsQuad  =  abs(0.25  *  int ( ( ( thirdDeriv) ) A ( 1 /3 ) , a, b) / ( 3*epps ) A ( 1 /3 ) ) 
numSegsQuad_NONUNIFORM  = 

ceil ( 0 . 25*mylnt (abs (thirdDeriv) A (1/3) ,a,b) / ( 3* epps ) A ( 1 /3 ) )  ; 

%  Substituting  values 

a=xmin; 

b=xmax; 

epps=  2A (-n-1) ; 

%numSegsLin  NONUNIFORM=ceil (abs (subs (SegsLin) ) ) ; 

%numSegsQuad_NONUNIFORM=ceil (abs (subs (SegsQuad) ) )  ; 

dummyX= [a: (b-a) /100 :b] ' ; 

max  2ndDeriv=max (abs ( (subs ( SecondDeriv,  dummyX) ) ) )  ; 
segWidth  Linear=4*sqrt (epps/max  2ndDeriv) ; 

max  3rdDeriv=max (abs (subs (thirdDeriv, dummyX) ) ) ; 
segWidth_Quad=4* (3*epps/max_3rdDeriv) A (1/3)  ; 

numSegsLin  UNIFORM=ceil ( (b-a) /segWidth  Linear); 
numSegsQuad_UNIFORM=ceil ( (b-a) /segWidth_Quad) ; 

numSegs= [numSegsLin  UNIFORM; 
numSegsLin_NONUNIFORM; 
numSegsQuad  UNIFORM; 
numSegsQuad_NONUNIFORM] ; 


FILE:  totalHUPandDelay .m 


function  [totHUP  totalDelay]  =  totalHUPandDelay (components, dependence, compNames) 


ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo 


%  depBoxes.m 


o, 

o 

o, 

o 

Q, 

O 

O, 

O 

O, 

O 

O, 

o 

o, 

o 

o, 

o 

o, 

o 


This  function/program  calculates  the  delay  and  percent  hardware 
utilization  given  up  to  12  components  and  a  dependence  relationship. 
It  is  used  to  calculate  circuit  components  in  series  and  in  parallel 
and  the  combined  delay  of  multiple  components  which  is  dependent  on 
one  components  relationship  to  another. 

This  function  was  modified  from  HUPboxes,  which  plots  the  outputs 


O, 

o 

o, 

o 

o, 

o 

o, 

o 

o, 

o 

g, 

o 

g, 

o 

o, 

o 

o, 

o 
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%  function  [totHUP  totalDelay]  = 

totalHUPandDelay (components, dependence, compNames) 

o,  o 

o  o 


o, 

o 

Input:  components: 

nx2  array  of  components  arranged 

g, 

o 

0, 

o 

n  =  row  number  =  the  component  number 

g, 

o 

o, 

o 

g. 

Max  number  of  ROWs  is  12 

g, 

o 

g. 

o 

g, 

o 

each  row  contains  : 

o 

g, 

o 

g, 

o 

0, 

[  HUP  timedelay  ] 

g, 

o 

0, 

o 

g, 

o 

dependence : 

an  nxn  array  that  defines  the  dependence 

o 

g, 

o 

g, 

o 

of  the  components. 

g, 

o 

g, 

o 

For  each  row,  the  array  should  contain  a  1  if 

g, 

o 

o, 

o 

the  component  number  (row#)  has  to  wait  until 

g, 

o 

g, 

o 

o. 

another  component  is  completed  (in  series) . 

0, 

o 

g. 

o 

g, 

o 

compNames : 

an  nxl  column  of  strings,  naming  each  component 

o 

g, 

o 

g, 

o 

strings  must  be  the  same  length,  can  add  extra 

g, 

o 

g, 

o 

g. 

spaces . 

g, 

o 

g. 

o 

g, 

o 

Output : 

totHUP: 

hardware  utilization  percentage 

o 

g, 

o 

g, 

o 

totalDelay : 

total  composite  circuit  delay 

g, 

o 

g, 

o 

g. 

Comments : 

g, 

o 

g. 

o 

g, 

o 

Created  by: 

Tim  Knudstrup 

o 

g. 

o 

g, 

o 

g. 

Date : 

25  September  2007 

g, 

o 

g. 

g,  g. 
o  o 

5'S-9'9'9'9'5'9'2'9'9- 

ooooooooooo 

ooooooooooo 

2'9'2'9'9'9'9'S-5'9'9'9'9'9'9'9'9'9'5'9'9'S-2'9'2'9'2'9'2'S-9'9'9'9'5'9'9'9'5'9'2'9'9'9'9'9'9'9'9' 

ooooooooooooooooooooooooooooooooooooooooooooooooo 

g.  g, 
o  o 

numComps=size (components) ; 
numComps=numComps (1) ; 

%  Color  list  (each  Row  contains  a  different  color  code  (upto  12)) 
Clist  =  [  0.5  0  0 

0  0  0.5 

0  0.5  0 
0.5  0.5  0 
0.5  0  0.5 
0  0.5  0.5 
0.75  0  0 
0  0  0.75 

0  0.75  0 
0.75  0.75  0 
0.75  0  0.75 
0  0.75  0.75]  ; 

compEnds=zeros (l,numComps) ; 
compStarts=compEnds ; 

compTop=compEnds ; 
compBot=compEnds ; 

for  comp=l : numComps 

if  (sum (dependence (comp, :)) ==0) 
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compStarts (comp) =0; 

else 

compDep=find (dependence (comp,  : ) ) ; 
compStarts (comp) =max (compEnds (compDep) ) ; 

end 

compEnds (comp) =compStarts (comp) Icomponents (comp, 2 ) ; 

end 

compStarts ; 
compEnds ; 

for  comp  =  l:numComps 
if  (comp==l) 

compBot (comp) =0; 

else 

sameStart=f ind (compStarts ( 1 : comp-1 ) ==compStarts (comp) ) ; 
if  isempty (sameStart) 

compDep=find (dependence (comp, : ) ) ; 

[y  indx]  =  max (compEnds (compDep) ) ;  %  finds  index  into 
compBot (comp) =compBot (indx) ; 

else 

largestTop=max (sameStart) ; 
compBot (comp) =compTop (largestTop) ; 

end 

end 

compTop (comp) =compBot (comp) Icomponents (comp, 1 ) ; 

end 

compBot; 

compTop; 

%  OUTPUT  Data 
totalDelay=max (compEnds) ; 
totHUP=sum (components ( : , 1 ) ) ; 
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APPENDIX  B.  DATA  COLLECTION 


B.l  DATA  COLLECTION  WITH  XILINX ISE  PROJECT  NAVIGATOR 


Xilinx  ISE  Project  Navigator  was  used  extensively  to  construct  schematic  and 
behavioral  sources  in  order  to  estimate  hardware  utilization  and  delay. 

1.  HDL  Sources 

Behavioral  VHDL  sources  were  written  in  Xilinx  ISE  Project  Navigator  for 
multipliers  and  adders.  Some  circuits  were  constructed  from  schematics  using  Xilinx’s 
primitive  hardware.  These  circuits  produced  verilog  code  during  the  synthesis  process. 
The  vf- files  for  the  schematic  circuits  are  also  shown  in  this  appendix. 

The  VHDL  sources  have  been  changed  during  the  data  collection  phase  of  this 
thesis  in  order  to  collect  information  on  various  sized  circuits.  For  example,  the  number 
of  input  and  output  bits  of  the  behavioral  adder  were  altered  for  various  values  between  1 
and  129.  The  VHDL  code  shown  in  this  appendix  is  the  most  recently  used  file. 


FILE:  Adder  64.vhd 


--  Company: 

NPS 

--  Engineer: 

Tim  Knudstrup 

--  Create  Date: 

08/2/07 

--  Design  Name: 

--  Module  Name: 

adder  64bit  -  Behavioral 

--  Project  Name: 

--  Target  Device: 

--  Tool  versions: 

--  Description: 

--  Dependencies: 

--  Revision: 

--  Revision  0.01  - 

File  Created 

--  Additional  Comments: 

library  IEEE; 

use  IEEE. STD  LOGIC 

1164 .ALL; 

use  IEEE. STD  LOGIC 

ARITH . ALL; 

use  IEEE. STD  LOGIC 

UNSIGNED. ALL; 

143 


-  Uncomment  the  following  library  declaration  if  instantiating 

-  any  Xilinx  primitives  in  this  code. 

— library  UNISIM; 

--use  UNISIM. VComponents . all; 

entity  Adder  64  is 

Port  (  a  :  in  std  logic  vector(128  downto  0) ; 

b  :  in  std  logic  vector(128  downto  0) ; 
sum  :  out  std_logic_vector ( 12 8  downto  0) ) ; 
end  Adder  64; 

architecture  Behavioral  of  Adder  64  is 
begin 

sum  <=  a+b; 
end  Behavioral; 


FILE:  Multiplier . vhd 


--  Company:  NPS 

--  Engineer:  Tim  Knudstrup 


--  Create  Date:  08/2/07 

--  Design  Name: 

--  Module  Name:  Multiplier  -  Behavioral 

--  Project  Name: 

--  Target  Device: 

--  Tool  versions: 

--  Description: 


--  Dependencies: 

--  Revision: 

--  Revision  0.01  -  File  Created 
--  Additional  Comments: 


library  IEEE; 

use  IEEE . STD_LOGIC_1164 .ALL; 
use  IEEE . STD_LOGIC_ARITH. ALL; 
use  IEEE. STD  LOGIC  UNSIGNED. ALL; 


-  Uncomment  the  following  library  declaration  if  instantiating 

-  any  Xilinx  primitives  in  this  code. 

— library  UNISIM; 

--use  UNISIM. VComponents . all; 


entity  Multiplier  is 

Port  (  a  :  in  std_logic_vector ( 1 6  downto  0) ; 

b  :  in  std_logic_vector ( 1 6  downto  0)  ; 
sum  :  out  std  logic  vector (33  downto  0) ) ; 
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end  Multiplier; 


architecture  Behavioral  of  Multiplier  is 


begin 


sum  <=  a*b; 


end  Behavioral; 


FILE:  muxl28tol.vf 

//////////// ////// ////// /// ///////// // ////// // ////// // ////// ////// /// /// /// / 

//  Copyright  (c)  1995-2007  Xilinx,  Inc.  All  rights  reserved. 

//////////////////////////////////////////////////////////////////////////// 

//  _  _ 

//  /  /\/  / 

//  / _ /  \  /  Vendor:  Xilinx 

//  \  \  \/  Version  :  9.2.02i 

//  \  \  Application  :  sch2verilog 

//  /  /  Filename  :  muxl28tol.vf 

//  / _ /  /\  Timestamp  :  11/11/2007  12:03:00 

//  \  \  /  \ 

//  \ _ \/\ _ \ 

// 

//Command:  C:\Xilinx92i\bin\nt\sch2verilog.exe  -intstyle  ise  -family  virtex2 
"C : /Documents  and  Settings/HP  Owner/My 

Documents/ schoolStuff /Thesis/VHDL/ThesisVHDLSims/muxl28tol . sch"  muxl28tol . vf 
//Design  Name:  muxl28tol 
//Device:  virtex2 
/ /Purpose : 

//  This  verilog  netlist  is  translated  from  an  ECS  schematic.lt  can  be 
//  synthesized  and  simulated,  but  it  should  not  be  modified. 

// 

'timescale  Ins  /  lps 


module  M2_lE_MXILINX_muxl28tol (DO, 

Dl, 
E, 
SO, 
0)  ; 

input  DO; 
input  Dl; 
input  E; 
input  SO; 
output  0; 

wire  MO; 
wire  Ml; 


AND3  I_36_30  (.I0(D1), 

•  II  (E) , 

. 12 (SO) , 

. 0 (Ml ) ) ; 

AND3B1  I  36  31  (  .  10  (SO) , 


.11 (E) , 

. 12 (DO) , 

. 0 (MO ) ) ; 

OR2  I_36_38  (.10 (Ml), 

.11 (MO), 

.0(0) )  ; 

endmodule 

'timescale  Ins  /  lps 

module  M4_lE_MXILINX_muxl2 8tol  (DO, 

Dl, 

D2 , 

D3 , 

E, 

50, 

51, 

0)  ; 

input  DO; 
input  Dl; 
input  D2; 
input  D3; 
input  E; 
input  SO; 
input  SI; 
output  0; 

wire  M01; 
wire  M2  3; 

M2_lE_MXILINX_muxl2  8tol  I_M01  (.DO  (DO), 

.Dl (Dl) , 

•  E  (E) , 

.SO (SO) , 

. 0 (M01 ) ) ; 

//  synthesis  attribute  HU  SET  of  I  M01  is  "I  M01  1 
M2_lEjyiXILINXj-nuxl2  8tol  I_M23  (.D0(D2), 

. Dl (D3) , 

•E (E) , 

.SO (SO)  , 

.  0  (M2  3)  )  ; 

//  synthesis  attribute  HU  SET  of  I  M23  is  "I  M23  0 
MUXF5  I_0  ( . 10 (M01) , 

.11 (M2  3) , 

•  S  (SI) , 

.0(0) ) ; 

endmodule 

'timescale  Ins  /  lps 

module  muxl28tol (Dataln, 

Sel , 

XLXN_9, 

XLXN_2  0 ) ; 

input  [127:0]  Dataln; 


input  [6:0]  Sel; 
input  XLXN_9; 
output  XLXN_20; 

wire  XLXN_1; 
wire  XLXN_2; 
wire  XLXN_3 ; 
wire  XLXN  4; 


mux32tol  XLXI  2 


mux32tol  XLXI  3 


mux32tol  XLXI  4 


mux32tol  XLXI  5 


(  .CE (XLXN_9) , 

. dataln (Dataln [127:96]), 
.Sel (Sel  [4 :0] ) , 

,XLXN_125 (XLXN_1) ) ; 

(  .CE (XLXN_9) , 

. dataln (Dataln [95:64] )  , 
.Sel (Sel  [4 :0] ) , 

,XLXN_125 (XLXN_2) ) ; 

(  .CE (XLXN_9) , 

. dataln (Dataln [63:32] )  , 
.Sel (Sel  [4 :0] ) , 

,XLXN_125 (XLXN_3) ) ; 

(  .CE (XLXN_9 ) , 

.dataln (Dataln [31:0] )  , 


. Sel ( Sel [4:0] ) , 

,XLXN_125 (XLXN_4) ) ; 

M4_lE_MXILINX_muxl2  8tol  XLXI__6  (  .  DO  (XLXN_1 )  , 

. D1 (XLXN_2 ) , 

. D2 (XLXN_3) , 

. D3 (XLXN_4) , 

.E (XLXN_9) , 

.SO (Sel  [5] )  , 

.SI (Sel [6] ) , 

.0 (XLXN_20) ) ; 

//  synthesis  attribute  HU  SET  of  XLXI  6  is  "XLXI  6  2" 
endmodule 


FILE:  fanouts. vf 


//////////////////////////////////////////////////////////////////////////////// 

//  Copyright  (c)  1995-2007  Xilinx,  Inc.  All  rights  reserved. 

//////////////////////////////////////////////////////////////////////////////// 
// 


// 

/ 

/\/  / 

// 

/ 

/  \  / 

Vendor:  Xilinx 

// 

\ 

\  \/ 

Version  :  9.2.02i 

// 

\ 

\ 

Application  :  sch2verilog 

// 

/ 

/ 

Filename  :  fanouts. vf 

// 

/ 

/  /\ 

Timestamp  :  11/11/2007  12:03:12 

// 

\ 

\  /  \ 

// 

\ 

\/\  \ 

// 

//Command:  C:\Xilinx92i\bin\nt\sch2verilog.exe  -intstyle  ise  -family  virtex2  -w 
"C : /Documents  and  Settings/HP  Owner/My 

Documents/ schoolStuf f /Thesis/VHDL/ThesisVHDLSims/ fanouts . sch"  fanouts . vf 
//Design  Name:  fanouts 
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//Device:  virtex2 
/ /Purpose : 

//  This  verilog  netlist  is  translated  from  an  ECS  schematic.lt  can  be 
//  synthesized  and  simulated,  but  it  should  not  be  modified. 

// 

'timescale  Ins  /  lps 

module  AND 12  MXILINX  fanouts (10, 


input  10; 
input  II; 
input  12; 
input  13; 
input  14; 
input  15; 
input  16; 
input  17; 
input  18; 
input  19; 
input  110; 
input  Ill; 
output  0; 

wire  dummy; 
wire  SO; 
wire  SI; 
wire  S2; 
wire  0  DUMMY; 


assign  0=0  DUMMY; 

FMAP  I_36_29  (.11(10), 

.12(11), 

.13  (12) , 

.14 (13) , 

•  O(SO) )  ; 

//  synthesis  attribute  RLOC  of  I_36  29  is  "XOYO1 
AND4  I_36_110  (.10(10), 

•  II  (ID  , 

.12  (12)  , 

.13 (13) , 

•  O(S0) )  ; 

AND4  I_36_127  (.10(14), 

.11 (15) , 


FMAP  I  36  138 


//  synthesis 
FMAP  I  36  142 


//  synthesis 
AND4  I  36  151 


AND3  I  36  177 


FMAP  I  36  181 


//  synthesis 
endmodule 
'timescale  Ins  / 


.12 (16) , 

.13 (17)  , 

•  0(S1)  )  ; 

(.11  (14)  , 

.12 (15)  , 

.13 (16) , 

.14 (17)  , 

•  0(S1)  )  ; 

attribute  RLOC  of  I_36  138  is 
(.11(18), 

.12 (19)  , 

. 13 (110)  , 

.14 (Ill)  , 

. 0  ( S2 ) )  ; 

attribute  RLOC  of  I  36  142  is 
(.10(18), 

.11 (19)  , 

. 12 (110)  , 

.13 (Ill)  , 

. 0  ( S2 ) )  ; 

(.I0(S0) , 

•II (SI) , 

. 12  ( S  2 ) , 

.0 (0_DUMMY) ) ; 

(.11 (SO) , 

.12 (SI) , 

. 13 ( S2 ) , 

. I 4 ( dummy ) , 

.0 (0_DUMMY) ) ; 

attribute  RLOC  of  I  36  181  is 


module  AND1 6_MXILINX_f anouts (10, 

11, 

12, 

13, 

14, 

15, 

16, 

17, 

18, 
19, 
110, 

111, 

112, 

113, 

114, 

115, 

0)  ; 


input  10; 
input  II; 
input  12; 
input  13; 


input  14; 
input  15; 
input  16; 
input  17; 
input  18; 
input  19; 
input  110; 
input  Ill; 
input  112; 
input  113; 
input  114; 
input  115; 
output  0; 

wire  CIN; 
wire  CO; 
wire  Cl; 
wire  C2; 
wire  SO; 
wire  SI; 
wire  S2; 
wire  S3; 
wire  XLXN  4  6; 


MUXCY  L  I  36 


//  synthesis 
FMAP  I  36  29 


//  synthesis 
VCC  I_36_107 
GND  I_36_109 
AND4  I  36  110 


AND4  I  36  127 


MUXCY  L  I  36 


//  synthesis 
FMAP  I  36  138 


2  ( . Cl (CIN) , 

. DI (XLXN_46) , 

.S (SO)  , 

.LO(CO) )  ; 

attribute  RLOC  of  I  36  2  is  " 
(.11(10), 

.12 (II)  , 

.13 (12)  , 

.14 (13)  , 

•  O(SO) )  ; 

attribute  RLOC  of  I  36  29  is 
(  .  P (CIN) )  ; 

( .G (XLXN_46) ) ; 

(.10(10)  , 

•  II  (ID  , 

.12 (12)  , 

.13 (13)  , 

•  O(S0) )  ; 

(.10(14)  , 

.11 (15)  , 

.12 (16)  , 

.13  (17)  , 

•  0(S1)  )  ; 

129  (.CI(C0), 

.DI (XLXN_46) , 

•  S  (SI)  , 

■  LO(Cl) )  ; 

attribute  RLOC  of  I  36  129  is 
(.11(14), 

.12 (15)  , 

.13 (16)  , 


.14  (17)  , 

•0(S1) ) ; 

//  synthesis  attribute  RLOC  of  I_36_138  is 

FMAP  I_36_142  (.11(18), 

.12 (19) , 

. 13  (110) , 

.14  (Ill) , 

. 0  ( S2 ) )  ; 

//  synthesis  attribute  RLOC  of  I_36  142  is 

MUXCY_L  I_36_147  (.Cl (Cl), 

. DI (XLXN_46) , 

•  S  (S2) , 

. LO (C2 ) ) ; 

//  synthesis  attribute  RLOC  of  I_36  147  is 

AND4  I_36_151  (.10(18), 

.11(19), 

. 12  (110) , 

.13  (Ill) , 

. 0  ( S2 ) )  ; 

AND4  I_36_161  (.10(112), 

.11(113), 

.12  (114) , 

. 13(115) , 

. 0 ( S3 ) ) ; 

MUXCY  I_36_165  (.CI(C2), 

.DI (XLXN_46) , 

.S  (S3) , 

.0(0) ) ; 

//  synthesis  attribute  RLOC  of  I_36_165  is 

FMAP  I_36_170  (.11(112), 

.12(113), 

.13  (114) , 

.14 (115) , 

. 0 ( S3 ) ) ; 

//  synthesis  attribute  RLOC  of  I_36  170  is 
endmodule 

'timescale  Ins  /  lps 

module  AND9_MXILINX_f anouts (10, 

11, 

12, 

13, 

14, 

15, 

16, 

17, 

18, 

0)  ; 


input  10; 
input  II; 
input  12; 
input  13; 
input  14; 
input  15; 


input  16; 
input  17; 
input  18; 
output  0; 

wire  dummy; 
wire  SO; 
wire  SI; 
wire  0_DUMMY; 

assign  0=0  DUMMY; 

FMAP  I_36_29  (.11(10), 

.12(11), 

.13  (12) , 

.14  (13) , 

•  O(SO) )  ; 

//  synthesis  attribute  RLOC  of  I_36_29  is 
AND4  I_36_110  (.10(10), 

•  II  (ID  , 

.12  (12) , 

.13  (13) , 

•O(SO) ) ; 

AND4  I_36_127  (.10(14), 

.11  (15) , 

.12 (16) , 

.13  (17)  , 

•  0(S1) )  ; 

FMAP  I_36_138  (.11(14), 

.12(15), 

.13 (16) , 

.14  (17) , 

•  0(S1) )  ; 

//  synthesis  attribute  RLOC  of  I_36_138  is 
FMAP  I_36_142  (.11  (SO), 

.12  (SI) , 

.13  (18) , 

. 14 (dummy) , 

.0 (0_DUMMY) ) ; 

//  synthesis  attribute  RLOC  of  I_36  142  is 
AND3  I_36_176  (.10 (SO), 

•II (SI), 

.12  (18) , 

.0 (0_DUMMY) ) ; 

endmodule 

'timescale  Ins  /  lps 


module  AND8_MXILINX_f anouts (10, 

11 

12 

13 

14 

15 

16 
17 
0) 


input  10; 
input  II; 
input  12; 
input  13; 
input  14; 
input  15; 
input  16; 
input  17; 
output  0; 

wire  dummy; 
wire  SO; 
wire  SI; 
wire  0  DUMMY; 


assign  0  =  0_ 
FMAP  I  36  29 


//  synthesis 
AND4  I  36  110 


AND4  I  36  127 


FMAP  I  36  138 


//  synthesis 
AND2  I  36  142 


FMAP  I  36  152 


//  synthesis 
endmodule 
'timescale  Ins  / 


DUMMY; 

(.11 (10) , 

.12 (II)  , 

.13 (12)  , 

.14 (13)  , 

•  O(SO) )  ; 

attribute  RLOC  of  I  36  29  is 
(.10(10), 

•  II  (ID  , 

.12 (12)  , 

.13 (13)  , 

•  O(S0) )  ; 

(.10(14)  , 

.11 (15)  , 

.12 (16) , 

.13 (17)  , 

•  0(S1)  )  ; 

(.11 (14)  , 

.12 (15)  , 

.13 (16) , 

.14  (17)  , 

•  0(S1)  )  ; 

attribute  RLOC  of  I  36  138  is 
(.I0(S0), 

•  II (SI)  , 

.0 (0_DUMMY) ) ; 

(.11 (SO)  , 

.12 (SI)  , 

.  1 3 ( dummy ) , 

. 14 (dummy) , 

.0 (0_DUMMY) ) ; 

attribute  RLOC  of  I  36  152  is 


module  AND7_MXILINX_f anouts (10, 

11 

12 
13 


input  10; 
input  II; 
input  12; 
input  13; 
input  14; 
input  15; 
input  16; 

output  0; 

wire  136; 

wire  0_DUMMY; 

assign  0=0  DUMMY; 

AND4  I_36_69  (.10(13), 

.11(14), 

.12  (15) , 

.13 (16) , 

.0(136) ) ; 

AND4  I_36_85  (.10(10), 

•  II  (ID  , 

.12  (12) , 

. 13 (136) , 

.0 (0_DUMMY) ) ; 

FMAP  I_36_98  (.11(10), 

.12  (II) , 

.13  (12)  , 

. 14 (136) , 

.0 (0_DUMMY) ) ; 

//  synthesis  attribute  RLOC  of  I_36_98  is 

FMAP  I_36_110  (.11(13), 

.12(14), 

.13  (15) , 

.14 (16) , 

.0(136) ) ; 

//  synthesis  attribute  RLOC  of  I_36  110  is 
endmodule 

'timescale  Ins  /  lps 

module  AND6_MXILINX_f anouts (10, 

11, 

12, 

13, 

14, 

15, 

0)  ; 


input  10; 
input  II; 
input  12; 
input  13; 


input  14; 
input  15; 
output  0; 

wire  dummy; 
wire  135; 
wire  0  DUMMY; 


assign  0=0  DUMMY; 

AND3  I_36_69  (.10(13), 

.11(14), 

.12  (15) , 

.0(135) )  ; 

AND4  I_36_85  (.10(10), 

•  II  (ID  , 

.12  (12) , 

. 13 (135) , 

.0 (0_DUMMY) ) ; 

FMAP  I_36_93  (.11(13), 

.12  (14)  , 

.13 (15) , 

. I 4 ( dummy ) , 

.0(135) )  ; 

//  synthesis  attribute  RLOC  of  I_36_93  is 
FMAP  I_36_94  (.11(10), 

.12(11), 

.13  (12) , 

. 14 (135) , 

.0 (0_DUMMY) ) ; 

//  synthesis  attribute  RLOC  of  I_36_94  is 
endmodule 

'timescale  Ins  /  lps 

module  fanouts (XLXN  115, 

XLXN~520, 

XLXN_537 , 

XLXN_1 1 8 , 

XLXN_1 4  4 , 

XLXN_4  83 , 

XLXN_484, 

XLXN_485, 

XLXN_486, 

XLXN_4  87 , 

XLXN_52 1 , 

XLXN_522 , 

XLXN_523, 

XLXN_524 , 

XLXN_525, 

XLXN_603 ) ; 

input  XLXN  115; 
input  XLXN  520; 
input  XLXN  537; 
output  XLXN_118; 
output  XLXN_ 144 ; 


output  XLXN  483; 

output  XLXN  484; 

output  XLXN  485; 

output  XLXN  4  86; 

output  XLXN  487; 

output  XLXN  521; 

output  XLXN  522; 

output  XLXN  523; 

output  XLXN  524; 

output  XLXN  525; 

output  XLXN  603; 

wire  XLXN  11; 

wire  XLXN  112; 

wire  XLXN  152; 

wire  XLXN  212; 

wire  XLXN  24  9; 

wire  XLXN  258; 

wire  XLXN  503; 

wire  XLXN  506; 

wire  XLXN  509; 

wire  XLXN  517; 

AND2  XLXI  3  (.10 (XLXN  115) 

r 

.11 (XLXN  115) 

r 

.0 (XLXN  112) ) 

r 

AND3  XLXI  4  (.10 (XLXN  112) 

r 

.11 (XLXN  112) 

r 

. 12 (XLXN  112) 

r 

.0 (XLXN  152) ) 

r 

AND4  XLXI  5  (.10 (XLXN  152) 

r 

.11 (XLXN  152) 

r 

. 12 (XLXN  152) 

r 

.13 (XLXN  152) 

r 

.0 (XLXN  11) )  ; 

AND 6  MXILINX  fanouts  XLXI 

7  ( . 10 (XLXN  11) , 

.11 (XLXN  11) , 

. 12 (XLXN  11) , 

. 13 (XLXN  11) , 

. 14 (XLXN  11) , 

. 15 (XLXN  11) , 

.0 (XLXN  212) )  ; 

//  synthesis  attribute  HU 

SET  of  XLXI  7  is  "XLXI  7  3" 

AND16  MXILINX  fanouts  XLXI 

17  ( . 10 (XLXN  249) , 

.11 (XLXN  249)  , 

.  12  (XLXN  249) , 

.  13 (XLXN  249)  , 

.  14  (XLXN  249)  , 

.  15  (XLXN  249)  , 

.16 (XLXN  249)  , 

.  17  (XLXN  249)  , 

.18  (XLXN  249)  , 

.19  (XLXN  249)  , 

.110 (XLXN  249) , 

.Ill (XLXN  249) , 
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.112 (XLXN_249) , 

.113 (XLXN_249) , 

.114 (XLXN_249) , 

.115 (XLXN_249) , 

.0 (XLXN_517) ) ; 

//  synthesis  attribute  HU  SET  of  XLXI  17  is  "XLXI  17  9" 
AND8_MXILINX_f anouts  XLXI~21  ( . 10 (XLXN_115) , 

.11 (XLXN_115) , 

. 12 (XLXN_115) , 

. 13 (XLXN_115) , 

. 14 (XLXN_115) , 

. 15 (XLXN_115) , 

.16 (XLXN_115) , 

. 17 (XLXN_115) , 

.0 (XLXN_118) ) ; 

//  synthesis  attribute  HU  SET  of  XLXI  21  is  "XLXI_21  0" 
AND 8_MX I LINX_f anouts  XLXI~22  ( . 10 (XLXN_112) , 

.11 (XLXN_1 12 ) , 

. 12 (XLXN_1 12 ) , 

. 13 (XLXN_1 12 ) , 

. 14 (XLXN_112) , 

. 15 (XLXN_112) , 

.16 (XLXN_1 12 ) , 

. 17 (XLXN_112) , 

.0 (XLXN_144) ) ; 

//  synthesis  attribute  HU  SET  of  XLXI  22  is  "XLXI_22  1" 
AND 9_MX I LINX_f anouts  XLXI~23  ( . 10 (XLXN_152 ) , 

.11 (XLXN_152) , 

. 12 (XLXN_152) , 

. 13 (XLXN_152 ) , 

. 14 (XLXN_152) , 

. 15 (XLXN_152) , 

.16 (XLXN_152) , 

. 17 (XLXN_152 ) , 

.18 (XLXN_152 ) , 

.0 (XLXN_483) ) ; 

//  synthesis  attribute  HU  SET  of  XLXI  23  is  "XLXI  23_2" 
AND 9_MX I LINX_f anouts  XLXI_27  ( . 10 (XLXN_11) , 

.11 (XLXN_1 1 ) , 

. 12 (XLXN_11) , 

. 13 (XLXN_11) , 

. 14 (XLXN_11) , 

. 15 (XLXN_11) , 

.16 (XLXN_1 1 ) , 

. 17 (XLXN_1 1 ) , 

.18 (XLXN_11) , 

.0 (XLXN_484) ) ; 

//  synthesis  attribute  HU  SET  of  XLXI  27  is  "XLXI  27  4" 
AND 9_MXILINX_f anouts  XLXI~28  ( . 10 (XLXN_258) , 

.11 (XLXN_258 ) , 

. 12 (XLXN_258) , 

. 13 (XLXN_258) , 

. 14 (XLXN_258) , 

. 15 (XLXN_258) , 

.16 (XLXN  258) , 
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. 17 (XLXN_258) , 

.18 (XLXN_258) , 

.0 (XLXN_486) ) ; 

//  synthesis  attribute  HU  SET  of  XLXI  28  is  "XLXI_28  5" 
AND7_MXILINX_f anouts  XLXI~29  ( . 10 (XLXN_212) , 

.11 (XLXN_2 12 ) , 

. 12 (XLXN_2 12 ) , 

. 13 (XLXN_2 12 ) , 

. 14 (XLXN_2 12 ) , 

. 15 (XLXN_2 12 ) , 

.16 (XLXN_2 12 ) , 

.0 (XLXN_485) ) ; 

//  synthesis  attribute  HU  SET  of  XLXI  29  is  "XLXI_29_6" 
AND 8_MX I LINX_f anouts  XLXI_30  ( . 10 (XLXN_212) , 

.11 (XLXN_2 12 ) , 

. 12 (XLXN_2 12 ) , 

. 13 (XLXN_2 12 ) , 

. 14 (XLXN_2 12 ) , 

. 15 (XLXN_2 12 ) , 

.16 (XLXN_2 12 ) , 

. 17 (XLXN_2 12 ) , 

.0 (XLXN_258) ) ; 

//  synthesis  attribute  HU  SET  of  XLXI  30  is  "XLXI_30  7" 
AND 8_MX I LINX_f anouts  XLXI_39  ( . 10 (XLXN_258) , 

.11 (XLXN_258 ) , 

. 12 (XLXN_258) , 

. 13 (XLXN_258 ) , 

. 14 (XLXN_258) , 

. 15 (XLXN_258) , 

.16 (XLXN_258 ) , 

. 17 (XLXN_258) , 

.0 (XLXN_249) ) ; 

//  synthesis  attribute  HU  SET  of  XLXI  39  is  "XLXI_39  8" 
AND 9_MX I LINX_f anouts  XLXI^42  ( . 10 (XLXN_249) , 

.11 (XLXN_249) , 

. 12 (XLXN_249) , 

. 13 (XLXN_249) , 

. 14 (XLXN_249) , 

. 15 (XLXN_249) , 

.16 (XLXN_249) , 

. 17 (XLXN_249) , 

.18 (XLXN_249) , 

.0 (XLXN_487) ) ; 

//  synthesis  attribute  HU  SET  of  XLXI  42  is  "XLXI  42_10" 
AND16_MXILINX_f anouts  XLxI_60  ( . 10 (XLXN_517) , 

.11 (XLXN_517) , 

. 12 (XLXN_517) , 

. 13 (XLXN_517) , 

. 14 (XLXN_517) , 

. 15 (XLXN_517) , 

.16 (XLXN_517) , 

. 17 (XLXN_517) , 

.18 (XLXN_517) , 

.19 (XLXN_517) , 

.110 (XLXN  517) , 
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.Ill (XLXN_517) , 

.112 (XLXN_517) , 

.113 (XLXN_517) , 

.114 (XLXN_517) , 

.115 (XLXN_517) , 

.0 (XLXN_521) ) ; 

//  synthesis  attribute  HU  SET  of  XLXI  60  is  "XLXI  60  11" 

AND1 6_MXILINX_f anouts  XLxI_62  ( . 10 (XLXN_503) , 

.11 (XLXN_503) , 

. 12 (XLXN_503) , 

. 13 (XLXN_503) , 

. 14 (XLXN_503) , 

. 15 (XLXN_503) , 

.16 (XLXN_503) , 

. 17 (XLXN_503) , 

.18 (XLXN_503) , 

. 19 (XLXN_503) , 

.110 (XLXN_503) , 

.Ill (XLXN_503) , 

.112 (XLXN_503) , 

.113 (XLXN_503) , 

.114 (XLXN_503) , 

.115 (XLXN_503) , 

.0 (XLXN_522) ) ; 

//  synthesis  attribute  HU  SET  of  XLXI  62  is  "XLXI  62  16" 

AND16_MXILINX_f anouts  XLxI_63  ( . 10 (XLXN_506) , 

.11 (XLXN_506) , 

. 12 (XLXN_506) , 

. 13 (XLXN_506) , 

. 14 (XLXN_506) , 

. 15 (XLXN_506) , 

.16 (XLXN_506) , 

. 17 (XLXN_506) , 

.18 (XLXN_506) , 

. 19 (XLXN_506) , 

.110 (XLXN_506) , 

.Ill (XLXN_506) , 

.112 (XLXN_506) , 

.113 (XLXN_506) , 

.114 (XLXN_506) , 

.115 (XLXN_506) , 

.0 (XLXN_523) ) ; 

//  synthesis  attribute  HU  SET  of  XLXI  63  is  "XLXI  63_18" 

AND16_MXILINX_f anouts  XLxI_64  ( . 10 (XLXN_509) , 

.11 (XLXN_509) , 

. 12 (XLXN_509) , 

. 13 (XLXN_509) , 

. 14 (XLXN_509) , 

. 15 (XLXN_509) , 

. 16 (XLXN_509) , 

. 17 (XLXN_509) , 

.18 (XLXN_509) , 

. 19 (XLXN_509) , 

.110 (XLXN_509) , 

.Ill (XLXN  509) , 
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//  synthesis  attribute  HU_SET 
AND1 6_MXILINX_f anouts  XLXI_65 


//  synthesis  attribute  HU_SET 
AND16_MXILINX_f anouts  XLXI_66 


//  synthesis  attribute  HU_SET 
AND 12  MXILINX  fanouts  XLXI  67 


.112 (XLXN_ 
.113 (XLXN_ 
.114 (XLXN_ 
.115 (XLXN 


509) 

509) 

509) 

509) 


.0 (XLXN_524) ) 
of  XLXI  64  is 


"XLXI  64  12" 


( 


10 (XLXN_ 
II (XLXN_ 
12 (XLXN_ 
13  (XLXN_ 
14 (XLXN_ 
15 (XLXN_ 
16 (XLXN_ 
17 (XLXN_ 
18 (XLXN_ 
19 (XLXN_ 
110  (XLXN_ 
Ill (XLXN_ 
112 (XLXN_ 
113  (XLXN_ 
114 (XLXN_ 
115 (XLXN 


509)  , 
509)  , 
509)  , 
509)  , 
509)  , 
509)  , 
509)  , 
509)  , 
509)  , 
509)  , 
509) 
509) 
509) 
509) 
509) 
509) 


.0 (XLXN_525) ) 
of  XLXI  65  is 


"XLXI  65  13" 


( 


10 (XLXN_ 
II (XLXN_ 
12 (XLXN_ 
13 (XLXN_ 
14 (XLXN_ 
15  (XLXN_ 
16 (XLXN_ 
17 (XLXN_ 
18 (XLXN_ 
19 (XLXN_ 
110  (XLXN_ 
Ill (XLXN_ 
112 (XLXN_ 
113  (XLXN_ 
114 (XLXN_ 
115 (XLXN 


509)  , 
509)  , 
509)  , 
509)  , 
509)  , 
509)  , 
509)  , 
509)  , 
509)  , 
509)  , 
509) 
509) 
509) 
509) 
509) 
509) 


.0 (XLXN_603) ) 
of  XLXI  66  is 


'XLXI  66  14' 


( 


10 (XLXN_ 
II (XLXN_ 
12 (XLXN_ 
13 (XLXN_ 
14 (XLXN_ 
15 (XLXN_ 
16 (XLXN_ 
17 (XLXN_ 
18 (XLXN_ 
19 (XLXN 


520) 

520) 

517) 

517) 

517) 

517) 

517) 

517) 

517) 

517) 


.110 (XLXN_517 ) 
.Ill (XLXN_517 ) 
.0 (XLXN  503) )  ; 
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//  synthesis  attribute  HU_SET 
AND 12  MXILINX  fanouts  XLXI  69 


//  synthesis  attribute  HU_SET 
AND 12  MXILINX  fanouts  XLXI  72 


//  synthesis  attribute  HU_SET 
endmodule 


of  XLXI_67  is  "XLXI_67_15" 
( . 10 (XLXN_537 ) , 

.11 (XLXN_503 )  , 

.  12  (XLXN_503 ) , 

.  13 (XLXN_503 ) , 

.  14 (XLXN_503 ) , 

.  15 (XLXN_503 ) , 

.16 (XLXN_503 ) , 

.  17 (XLXN_503 ) , 

.18 (XLXN_503 ) , 

.19 (XLXN_503 ) , 

.110 (XLXN_503 ) , 

.Ill (XLXN_503) , 

.0 (XLXN_506) ) ; 
of  XLXI_69  is  "XLXI_69_17" 
( . 10 (XLXN_506) , 

.11 (XLXN_506) , 

.  12 (XLXN_506) , 

.  13 (XLXN_506) , 

.  14 (XLXN_506) , 

.  15 (XLXN_506) , 

.16 (XLXN_506) , 

.  17 (XLXN_506) , 

.18 (XLXN_506) , 

. 19 (XLXN_506) , 

.110 (XLXN_506) , 

.Ill (XLXN_506) , 

.0 (XLXN_509) ) ; 
of  XLXI  72  is  "XLXI  72  19" 


FILE :  bram2 . vf 


//////////////////////////////////////////////////////////////////////////////// 

//  Copyright  (c)  1995-2007  Xilinx,  Inc.  All  rights  reserved. 

//////////////////////////////////////////////////////////////////////////////// 
// 


// 

/ 

/\/  / 

// 

/ 

/  \  / 

Vendor:  Xilinx 

// 

\ 

\  \/ 

Version  :  9.2.02i 

// 

\ 

\ 

Application  :  sch2verilog 

// 

/ 

/ 

Filename  :  bram2.vf 

// 

/ 

/  /\ 

Timestamp  :  11/11/2007  12:03:10 

// 

\ 

\  /  \ 

// 

\ 

\/\  \ 

// 

//Command:  C:\Xilinx92i\bin\nt\sch2verilog.exe  -intstyle  ise  -family  virtex2  -w 
"C : /Documents  and  Settings/HP  Owner/My 

Documents/ schoolStuf f /Thesis/VHDL/ThesisVHDLSims/bram2 . sch"  bram2 . vf 
//Design  Name:  bram2 
//Device:  virtex2 
/ /Purpose : 

//  This  verilog  netlist  is  translated  from  an  ECS  schematic.lt  can  be 
//  synthesized  and  simulated,  but  it  should  not  be  modified. 

// 
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timescale  Ins  /  lps 

module  M2_l_MXILINX_bram2 (DO, 

Dl, 
SO, 
0)  ; 

input  DO; 
input  Dl; 
input  SO; 
output  0; 

wire  MO; 
wire  Ml; 

AND2B1  I_36_7  (.10  (SO), 

.11 (DO) , 

. 0 (MO ) ) ; 

0R2  I_36_8  (.10 (Ml), 

.11 (MO) , 

.0(0) ) ; 

AND2  I_36_9  (.10 (Dl), 

.11 (SO), 

. 0 (Ml ) ) ; 

endmodule 

'timescale  Ins  /  lps 

module  bram2 (Add, 

CLK, 

D_out) ; 

input  [14:0]  Add; 
input  CLK; 
output  D_out; 

wire  [0:0]  XLXN_3; 
wire  XLXN_6; 
wire  XLXN_8; 
wire  XLXN_9; 
wire  XLXN_11; 
wire  [0:0]  XLXN  15; 
wire  [0:0]  XLXN_16; 
wire  [0:0]  XLXN_17; 


RAMB1 6_S 1  XLXI_3  ( . ADDR (Add [ 13 : 0 ] ) , 

.CLK (CLK) , 

.Dl (XLXN_3 [0] ) , 

.EN (XLXN_6) , 

. SSR (XLXN_1 1 ) , 

.WE (XLXN_11) , 

.DO (XLXN_16 [0] ) ) ; 
defparam  XLXI_3.INIT  =  1'hO; 
defparam  XLXI_3 . INIT_00  = 

256'h0000000000000000000000000000000000000000000000000000000000000000 
defparam  XLXI  3.INIT  01  = 


256 'hO 000000000000000000000000000000000000000000000000000000000000000; 
defparam  XLXI_3 . INIT_02  = 

256' hO 000000000000000000000000000000000000000000000000000000000000000; 
defparam  XLXI_3 . INIT_03  = 

256'h0000000000000000000000000000000000000000000000000000000000000000; 

-  OMITTED  PARTS  of  ROM  initialization  FOR  BREVITY  - 

defparam  XLXI_3 . INIT_39  = 

256'h0000000000000000000000000000000000000000000000000000000000000000; 
defparam  XLXI_3 . INIT_3A  = 

256 'hO 000000000000000000000000000000000000000000000000000000000000000; 
defparam  XLXI_3 . INIT_3B  = 

256 'hO 000000000000000000000000000000000000000000000000000000000000000; 
defparam  XLXI_3 . INIT_3C  = 

256'h0000000000000000000000000000000000000000000000000000000000000000; 
defparam  XLXI_3 . INIT_3D  = 

256 'hO 000000000000000000000000000000000000000000000000000000000000000; 
defparam  XLXI_3 . INIT_3E  = 

256'h0000000000000000000000000000000000000000000000000000000000000000; 
defparam  XLXI__3  .  INIT_3F  = 

256 'hO 000000000000000000000000000000000000000000000000000000000000000; 
defparam  XLXI_3.SRVAL  =  1'hO; 
defparam  XLXI_3 . WRITE_MODE  =  "WRITE^FIRST" ; 

RAMB1 6_S 1  XLXI_4  ( . ADDR ( Add [ 1 3 : 0 ] ) , 

. CLK (CLK) , 

. DI (XLXN_17 [0] ) , 

.EN (XLXN_8) , 

. SSR (XLXN_9) , 

.WE (XLXN_9) , 

.DO (XLXN_15 [0] ) ) ; 
defparam  XLXI_4.INIT  =  1'hO; 
defparam  XLXI_4 . INIT_00  = 

256 'hO 000000000000000000000000000000000000000000000000000000000000000; 
defparam  XLXI_4 . INIT_01  = 

256'h0000000000000000000000000000000000000000000000000000000000000000; 
defparam  XLXI_4 . INIT_02  = 

256'h0000000000000000000000000000000000000000000000000000000000000000; 
defparam  XLXI_4 . INIT_03  = 

256'h0000000000000000000000000000000000000000000000000000000000000000; 
defparam  XLXI_4 . INIT_04  = 

-  OMITTED  PARTS  of  ROM  initialization  FOR  BREVITY  - 

256'h0000000000000000000000000000000000000000000000000000000000000000; 
defparam  XLXI__4  .  INIT_3F  = 

256'h0000000000000000000000000000000000000000000000000000000000000000; 
defparam  XLXI_4 . SRVAL  =  1'hO; 
defparam  XLXI_4 . WRITE^MODE  =  "WRITE^FIRST" ; 

M2_l_MXILINX_bram2  XLXI_5  ( .DO (XLXN_15 [0] ) , 

. Dl (XLXN_16 [0] ) , 

.SO (Add [14] ) , 

. O (D_out) ) ; 

//  synthesis  attribute  HU  SET  of  XLXI  5  is  "XLXI  5  0" 

GND  XLXI  6  ( . G (XLXN  11)); 
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GND  XLXI  7  ( 

.G (XLXN  9) ) ; 

GND  XLXI  8  ( 

. G (XLXN  17  [0] ) )  ; 

GND  XLXI  9  ( 

.G (XLXN_3 [0] ) ) ; 

VCC  XLXI  10 

( . P (XLXN  6) )  ; 

VCC  XLXI  11 

(  .  P (XLXN  8) )  ; 

endmodule 

FILE:  ramtester. vf 

//////////////////////////////////////////////////////////////////////////////// 

//  Copyright  (c)  1995-2007  Xilinx,  Inc.  All  rights  reserved. 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

//  _  _ 

//  /  /\/  / 

III  I  \  I  Vendor:  Xilinx 


// 

\ 

\  \/ 

Version  :  9 

.  2 . 02i 

// 

\ 

\ 

Application 

:  sch2verilog 

// 

/ 

/ 

Filename  : 

ramtester . vf 

// 

/ 

/  /\ 

Timestamp  : 

11/11/2007  12:03:07 

II 

\ 

\  /  \ 

II 

\ 

\/\  \ 

II 

//Command:  C:\Xilinx92i\bin\nt\sch2verilog.exe  -intstyle  ise  -family  virtex2  -w 
"C : /Documents  and  Settings/HP  Owner/My 

Documents/ schoolStuf f / Thesis /VHDL/ThesisVHDLSims/ r amt ester . sch"  ramtester . vf 
//Design  Name:  ramtester 
//Device:  virtex2 
/ /Purpose : 

//  This  verilog  netlist  is  translated  from  an  ECS  schematic.lt  can  be 
II  synthesized  and  simulated,  but  it  should  not  be  modified. 

// 

'timescale  Ins  /  lps 

module  ramtester (XLXN  1, 

XLXN_2 , 

XLXN_3, 

XLXN_4 , 

XLXN_5, 

XLXN_2  0 , 

XLXN_2 1 , 

XLXN_22, 

XLXN_25, 

XLXN_26, 

XLXN_23) ; 

input  XLXN_1; 
input  XLXN_2 ; 
input  XLXN_3 ; 
input  XLXN_4 ; 
input  XLXN  5; 
input  XLXN_20; 
input  XLXN_2 1 ; 
input  XLXN_22 ; 
input  XLXN  25; 
input  XLXN  26; 
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output  XLXN_23; 


RAM128X1S  XLXI_4  ( . AO (XLXN_20) , 

. A1 (XLXN_1) , 

. A2 (XLXN_2) , 

.A3 (XLXN_3) , 

. A4 (XLXN_4) , 

. A5 (XLXN_5) , 

. A6 (XLXN_22) , 

. D (XLXN_26) , 

. WCLK (XLXN_2 1 ) , 

.WE (XLXN_25) , 

.0 (XLXN_23) ) ; 

defparam  XLXI_4.INIT  =  12 8 ' hOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO ; 
endmodule 


2.  Synthesis  Reports 

The  synthesis  reports  were  generated  from  the  YHDL  files  above.  They  were 
generated  for  the  Xilinx  Virtex-II  XC26000  with  package  ff  1 517  and  with  a  speed  grade 
of  -4.  These  reports  were  used  to  gather  timing  and  hardware  utilization  parameters. 
The  key  parts  that  were  analyzed  were  the  number  of  LUTs  and  Slices  and  the  worst  case 
signal  propagation  path.  The  delay  due  the  IOBs  was  subtracted  from  the  total  delay  at 
the  end  of  each  synthesis  report  so  that  multiple  components  can  be  cascaded  inside  the 
FPGA.  Since  the  VHDL  files  were  modified  without  changing  the  names,  often  the  name 
of  the  synthesis  report  does  not  reflect  the  actual  size  of  the  device.  For  example, 
adder_64.syr,  shown  below  is  the  synthesis  report  for  a  129-bit  RCA. 

Parts  of  the  reports  have  been  omitted  in  this  appendix  for  the  sake  of  brevity. 
The  first  synthesis  report  (for  adder_64.syr)  shows  almost  everything  that  is  included  in  a 
synthesis  report.  The  following  synthesis  reports  show  only  information  that  is  pertinent 
to  this  thesis. 


FILE:  adder  64.syr 
Release  6.3.03i  -  xst  G.38 

Copyright  (c)  1995-2004  Xilinx,  Inc.  All  rights  reserved. 

-->  Parameter  TMPDIR  set  to  projnav 

CPU  :  0.00  /  0.51  s  |  Elapsed  :  0.00  /  0.00  s 

-->  Parameter  xsthdpdir  set  to  ./xst 
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CPU  :  0.00  /  0.51  s  |  Elapsed  :  0.00  /  0.00  s 

-->  Reading  design:  adder  64.prj 

TABLE  OF  CONTENTS 

1)  Synthesis  Options  Summary 

2)  HDL  Compilation 

3)  HDL  Analysis 

4)  HDL  Synthesis 

5)  Advanced  HDL  Synthesis 

5.1)  HDL  Synthesis  Report 

6)  Low  Level  Synthesis 

7)  Final  Report 

7.1)  Device  utilization  summary 

7.2)  TIMING  REPORT 

*  Synthesis  Options  Summary  * 

-  Source  Parameters 

Input  File  Name 

adder  64.prj 

Input  Format 

mixed 

Ignore  Synthesis  Constraint  File 

NO 

Verilog  Include  Directory 

-  Target  Parameters 

Output  File  Name 

adder  64 

Output  Format 

NGC 

Target  Device 

xc2v6000-4-f f 1517 

-  Source  Options 

Top  Module  Name 

adder  64 

Automatic  FSM  Extraction 

YES 

FSM  Encoding  Algorithm 

Auto 

FSM  Style 

lut 

RAM  Extraction 

Yes 

RAM  Style 

Auto 

ROM  Extraction 

Yes 

ROM  Style 

Auto 

Mux  Extraction 

YES 

Mux  Style 

Auto 

Decoder  Extraction 

YES 

Priority  Encoder  Extraction 

YES 

Shift  Register  Extraction 

YES 

Logical  Shifter  Extraction 

YES 

XOR  Collapsing 

YES 

Resource  Sharing 

YES 

Multiplier  Style 

auto 

Automatic  Register  Balancing 

No 

-  Target  Options 

Add  10  Buffers 

YES 

Global  Maximum  Fanout 

500 

Add  Generic  Clock  Buffer (BUFG) 

16 
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Register  Duplication 

YES 

Equivalent  register  Removal 

YES 

Slice  Packing 

YES 

Pack  10  Registers  into  IOBs 

auto 

-  General  Options 

Optimization  Goal 

Speed 

Optimization  Effort 

1 

Keep  Hierarchy 

NO 

Global  Optimization 

AllClockNets 

RTL  Output 

Yes 

Write  Timing  Constraints 

NO 

Hierarchy  Separator 

Bus  Delimiter 

<> 

Case  Specifier 

maintain 

Slice  Utilization  Ratio 

100 

Slice  Utilization  Ratio  Delta 

5 

-  Other  Options 

Iso 

adder  64. Iso 

Read  Cores 

YES 

cross  clock  analysis 

NO 

verilog2001 

YES 

Optimize  Instantiated  Primitives 

NO 

tristate2 logic 

No 

*  HDL  Compilation  * 

Compiling  vhdl  file  H : /Thesis/VHDL/ThesisVHDLSims/Adder  64.vhd  in  Library  work. 

Architecture  behavioral  of  Entity  adder  64  is  up  to  date. 

*  HDL  Analysis  * 

Analyzing  Entity  <adder  64>  (Architecture  <behavioral>) . 

Entity  <adder  64>  analyzed.  Unit  <adder  64>  generated. 

*  HDL  Synthesis  * 

Synthesizing  Unit  <adder  64>. 

Related  source  file  is  H : /Thesis/VHDL/ThesisVHDLSims/Adder  64.vhd. 

Found  129-bit  adder  for  signal  <sum>. 

Summary : 

inferred  1  Adder /Subtracter ( s ) . 

Unit  <adder  64>  synthesized. 
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Advanced  HDL  Synthesis 


Advanced  RAM  inference  .  .  . 

Advanced  multiplier  inference  . . . 
Advanced  Registered  AddSub  inference  .  .  . 
Dynamic  shift  register  inference  . . . 


HDL  Synthesis  Report 
Macro  Statistics 

#  Adders/Subtractors  :  1 

129-bit  adder  :  1 


Low  Level  Synthesis 


Optimizing  unit  <adder  64>  . . . 

Loading  device  for  application  Xst  from  file  '2v6000.nph'  in  environment 
C : /Xilinx . 

Mapping  all  equations... 

Building  and  optimizing  final  netlist  . . . 

Found  area  constraint  ratio  of  100  (+  5)  on  block  adder  64,  actual  ratio  is  0. 


Final  Report 


Final  Results 

RTL  Top  Level  Output  File  Name 
Top  Level  Output  File  Name 
Output  Format 
Optimization  Goal 
Keep  Hierarchy 


adder  64 . ngr 

adder_64 

NGC 

Speed 

NO 


Design  Statistics 

#  IOs  :  387 


Macro  Statistics  : 

#  Adders/Subtractors  :  1 

#  129-bit  adder  :  1 


Cell  Usage  : 

#  BELS 

#  GND 

#  LUT2 

#  MUXCY 

#  XORCY 

#  10  Buffers 

#  IBUF 

#  OBUF 


386 

1 

129 

128 

128 

387 
258 
129 
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Device  utilization  summary: 


Selected  Device  :  2v6000ff 1517-4 


Number 

Number 


of 

of 


Slices : 
4  input 


LUTs  : 


65  out  of  33792  0% 

129  out  of  67584  0% 


Number  of  bonded  IOBs : 


387  out  of  1104  35% 


TIMING  REPORT 

NOTE:  THESE  TIMING  NUMBERS  ARE  ONLY  A  SYNTHESIS  ESTIMATE. 

FOR  ACCURATE  TIMING  INFORMATION  PLEASE  REFER  TO  THE  TRACE  REPORT 
GENERATED  AFTER  PLACE-and-ROUTE . 

Clock  Information: 

No  clock  signals  found  in  this  design 
Timing  Summary: 


Speed  Grade:  -4 

Minimum  period:  No  path  found 

Minimum  input  arrival  time  before  clock:  No  path  found 
Maximum  output  required  time  after  clock:  No  path  found 
Maximum  combinational  path  delay:  14.963ns 

Timing  Detail: 


All  values  displayed  in  nanoseconds  (ns) 


Timing  constraint:  Default  path  analysis 

Delay:  14.963ns  (Levels  of  Logic  =  132) 

Source:  a<0>  (PAD) 

Destination:  sum<128>  (PAD) 

Data  Path:  a<0>  to  sum<128> 

Gate  Net 

Cell:in->out  fanout  Delay  Delay  Logical  Name  (Net  Name) 


IBUF : I->0  1  0.825 

LUT2 : IO->0  2  0.439 

MUXCY : S->0  1  0.298 

(adder_64_sum<0>_cyo) 

MUXCY : CI->0  1  0.053 

(adder_64_sum<l>_cyo) 

MUXCY :CI->0  1  0.053 

( adder_64_sum<2  >_cyo ) 

MUXCY :CI->0  1  0.053 


0.517  a_0_IBUF  (a_0_IBUF) 

0.000  adder_64_sum<0>lut  (sum_0_OBUF) 
0.000  adder 64 sum<0>cy 

0.000  adder  64  sum<l>cy 

0.000  adder  64  sum<2>cy 

0.000  adder_64_sum<3>cy 
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(adder  64  sum<3>  cyo) 

MUXCY : CI->0 

1 

0.053 

0.000 

adder  64  sum<4>cy 

-  PARTS  OMITTED 

(adder  64  sum<113>  cyo) 

FOR 

BREVITY 

MUXCY :CI->0 

(adder  64  sum<118>  cyo) 

1 

0.053 

0.000 

adder  64  sum<118>cy 

MUXCY :CI->0 

(adder  64  sum<119>  cyo) 

1 

0.053 

0.000 

adder  64  sum<119>cy 

MUXCY: Cl ->0 

(adder  64  sum<120>  cyo) 

1 

0.053 

0.000 

adder  64  sum<120>cy 

MUXCY: Cl ->0 

(adder  64  sum<121>  cyo) 

1 

0.053 

0.000 

adder  64  sum<121>cy 

MUXCY: Cl ->0 

(adder  64  sum<122>  cyo) 

1 

0.053 

0.000 

adder  64  sum<122>cy 

MUXCY :CI->0 

(adder  64  sum<123>  cyo) 

1 

0.053 

0.000 

adder  64  sum<123>cy 

MUXCY :CI->0 

(adder  64  sum<124>  cyo) 

1 

0.053 

0.000 

adder  64  sum<124>cy 

MUXCY: Cl ->0 

(adder  64  sum<125>  cyo) 

1 

0.053 

0.000 

adder  64  sum<125>cy 

MUXCY: Cl ->0 

(adder  64  sum<126>  cyo) 

1 

0.053 

0.000 

adder  64  sum<126>cy 

MUXCY :CI->0 

(adder  64  sum<127>  cyo) 

0 

0.053 

0.000 

adder  64  sum<127>cy 

XORCY : CI->0 

1 

1.274 

0.517 

adder  64  sum<128>  xor 

(sum  128  OBUF) 


OBUF : I->0  4.361  sum  128  OBUF  (sum<128>) 


Total  14.963ns  (13.928ns  logic,  1.035ns  route) 

(93.1%  logic,  6.9%  route) 


CPU  :  18.95  /  19.98  s  |  Elapsed  :  19.00  /  20.00  s 


Total  memory  usage  is  144088  kilobytes 


FILE:  fanouts. syr 
Release  6.3.03i  -  xst  G.38 

Copyright  (c)  1995-2004  Xilinx,  Inc.  All  rights  reserved. 

-  PARTS  OMITTED  FOR  BREVITY  - 

Input  File  Name  :  fanouts. prj 

-  PARTS  OMITTED  FOR  BREVITY  - 

Cell  Usage  : 

#  BELS  :  125 


170 


# 

AND2 

5 

# 

AND3 

9 

# 

AND4 

57 

# 

GND 

19 

# 

MUXCY 

7 

# 

MUXCY  L 

21 

# 

VCC 

7 

# 

10  Buffers 

16 

# 

IBUF 

3 

# 

OBUF 

13 

# 

Others 

68 

# 

FMAP 

68 

Device  utilization  summary: 

Selected  Device  :  2v6000ff 1517-4 

Number  of  Slices: 

14 

out 

of 

33792 

0% 

Number  of  bonded  IOBs: 

16 

out 

of 

1104 

1% 

TIMING  REPORT 


-  PARTS  OMITTED  FOR  BREVITY 

Data  Path:  XLXN  115  to  XLXN  524 


Cell : in->out 

fanout 

Gate 

Delay 

Net 

Delay 

Logical  Name 

(Net  Name) 

IBUF: I->0 

10 

0.825 

0.885 

XLXN 

115  IBUF 

(XLXN  115  IBUF) 

AND2 : Il->0 

11 

0.439 

0.909 

XLXI 

3  (XLXN 

112) 

AND3 : I2->0 

13 

0.439 

0.955 

XLXI 

4  (XLXN 

152) 

AND4 : I3->0 

15 

0.439 

0.989 

XLXI 

5  (XLXN 

11) 

begin  scope 

:  ' XLXI  7' 

AND3 : I2->0 

1 

0.439 

0.517 

I  36 

69  (135) 

AND4 : I3->0 

15 

0.439 

0.989 

H 

OJ 

CTi 

o 

LO 

CO 

end  scope: 

'XLXI  7' 

begin  scope 

:  'XLXI  30' 

AND4 : I3->0 

1 

0.439 

0.517 

I  36 

127  (SI) 

AND2 : Il->0 

17 

0.439 

1.012 

CO 

M 

142  (0) 

end  scope: 

'XLXI  30' 

begin  scope 

:  'XLXI  39' 

AND4 : I3->0 

1 

0.439 

0.517 

I  36 

127  (SI) 

AND2 : Il->0 

25 

0.439 

1.069 

hD 

CO 

M 

142  (0) 

end  scope: 

'XLXI  39' 

begin  scope 

:  'XLXI  17' 

AND4 : I3->0 

1 

0.439 

0.000 

I  36 

110  (SO) 

MUXCY  L : S->L0  1 

0.298 

0.000 

hO 

CO 

M 

2  (CO) 

MUXCY  L : Cl- 

>L0  1 

0.053 

0.000 

hO 

CO 

M 

129  (Cl) 

MUXCY  L : Cl- 

>L0  1 

0.053 

0.000 

hD 

CO 

\— 1 

147  (C2 ) 

MUXCY: Cl ->0 

26 

0.942 

1.072 

'X) 

CO 

M 

165  (0) 

end  scope: 

'XLXI  17' 
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begin  scope 

:  ' XLXI  67' 

AND4 : Il->0 

1 

0.439 

0.517 

I  36  151 

( S  2 ) 

AND3 : I2->0 

27 

0.439 

1.075 

I  36  177 

(0) 

end  scope: 

'XLXI  67' 

begin  scope 

:  'XLXI  69' 

AND4 : Il->0 

1 

0.439 

0.517 

I  36  151 

(S2 ) 

AND3 : I2->0 

28 

0.439 

1.077 

I  36  177 

(0) 

end  scope: 

'XLXI  69' 

begin  scope 

:  'XLXI  72' 

AND4 : Il->0 

1 

0.439 

0.517 

I  36  151 

( S  2 ) 

AND3 : I2->0 

48 

0.439 

1 . 129 

I  36  177 

(0) 

end  scope: 

'XLXI  72' 

begin  scope 

:  'XLXI  64' 

AND4 : I3->0 

1 

0.439 

0.000 

I  36  110 

(SO) 

MUXCY  L : S->LO  1 

0.298 

0.000 

I  36  2  (CO) 

MUXCY  L : Cl- 

>LO  1 

0.053 

0.000 

I  36  129 

(Cl) 

MUXCY  L : Cl- 

>LO  1 

0.053 

0.000 

I  36  147 

(C2 ) 

MUXCY :CI->0 

1 

0.942 

0.517 

I  36  165 

(0) 

end  scope: 
OBUF : I->0 

'XLXI  64' 

4.361 

XLXN  524 

OBUF 

(XLXN  524) 

Total 

30 . 125ns 

(15.341ns  logic. 

14  . 

784ns  route) 

(50.9% 

logic,  49 

.  1% 

route) 

CPU  :  6.50  /  7.51  s  |  Elapsed 

:  7.00  / 

8.00  s 

FILE:  BRAM2 . syr 
Release  6.3.03i  -  xst  G.38 

Copyright  (c)  1995-2004  Xilinx,  Inc.  All  rights  reserved. 

-  PARTS  OMITTED  FOR  BREVITY  - 

Input  File  Name  :  bram2.prj 

-  PARTS  OMITTED  FOR  BREVITY  - 


HDL  Synthesis  Report 

-  PARTS  OMITTED 

FOR  BREVITY  - 

— 

* 

Final 

Report 

* 

Final  Results 

RTL  Top  Level  Output  File 

Name 

:  bram2 . ngr 

Top  Level  Output  File  Name 

:  bram2 

Output  Format 

:  NGC 

Optimization  Goal 

:  Speed 

Keep  Hierarchy 

:  NO 
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Design  Statistics 


#  IOs  :  17 

Cell  Usage  : 

#  BELS  :  9 

#  AND2  :  1 

#  AND2bl  :  1 

#  GND  :  4 

#  OR2  :  1 

#  VCC  :  2 

#  RAMS  :  2 

#  RAMB1 6_S  1  :  2 

#  Clock  Buffers  :  1 

#  BUFGP  :  1 

#  10  Buffers  :  16 

#  IBUF  :  15 

#  OBUF  :  1 


Device  utilization  summary: 


Selected  Device  :  2v6000ff 1517-4 


Number 

of 

bonded  IOBs: 

16 

out 

of 

1104 

1  2- 

-L  75 

Number 

of 

BRAMs : 

2 

out 

of 

144 

1% 

Number 

of 

GCLKs : 

1 

out 

of 

16 

6% 

TIMING  REPORT 

NOTE:  THESE  TIMING  NUMBERS  ARE  ONLY  A  SYNTHESIS  ESTIMATE. 

FOR  ACCURATE  TIMING  INFORMATION  PLEASE  REFER  TO  THE  TRACE  REPORT 
GENERATED  AFTER  PLACE-and-ROUTE . 

Clock  Information: 


Clock  Signal 

- +  - 

Clock  buffer (FF 

name) 

+ - 

Load 

CLK 

BUFGP 

2 

-  PARTS  OMITTED  FOR 

BREVITY  - 

Data  Path:  XLXI  3  to  D  out 

Cell:in->out  fanout 

Gate 

Delay 

Net 

Delay 

Logical 

Name 

(Net  Name 

RAMB16  SI : CLK->DOO  1 

begin  scope:  'XLXI  5' 

AND2 : I 0->O  ~~  1 

2.599 

0.439 

0.517 

0.517 

XLXI  3 

1  _36  9 

(XLXN 

(Ml) 

16) 
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0R2 : IO->0  1  0.439  0.517  I_36_8  (0) 

end  scope:  ' XLXI  5' 

OBUF : I->0  ~~  4.361  D  out  OBUF  (D  out) 


Total  9.391ns  (7.838ns  logic,  1.552ns  route) 

(83.5%  logic,  16.5%  route) 


Timing  constraint 

:  Default  path  analysis 

Delay : 

7 . 800ns 

(Levels  of  Logic 

=  5) 

Source : 

Add<  1 4  > 

(PAD) 

Destination : 

D  out  (PAD) 

Data  Path:  Add<14>  to  D  out 

Gate  Net 

Cell : in->out 

fanout 

Delay  Delay 

Logical  Name  (Net  Name) 

IBUF : I->0 

2 

0.825  0.701 

Add  14  IBUF  (Add  14  IBUF) 

begin  scope: 

'XLXI  5' 

AND2bl : I0->0 

1 

0.439  0.517 

I  36  7  (M0) 

OR2 : Il->0 

1 

0.439  0.517 

o 

CO 

CO 

M 

end  scope:  ' 

XLXI  5' 

OBUF: I->0 

4.361 

D_out_OBUF  (D_out) 

Total 

7.800ns  (6.064ns  logic,  1.736ns  route) 

(77.7% 

logic,  22.3%  route) 

CPU  :  7.44  /  8.47  s  |  Elapsed  :  7.00  /  8.00  s 


PARTS  OMITTED  FOR  BREVITY 


FILE:  multiplier . syr 


Release  6.3.03i  -  xst  G.38 

Copyright  (c)  1995-2004  Xilinx,  Inc 

All  rights  reserved. 

-  PARTS  OMITTED  FOR  BREVITY  - 

Input  File  Name 

multiplier . pr j 

-  PARTS  OMITTED  FOR  BREVITY  - 

*  Final  Report  * 

Final  Results 

RTL  Top  Level  Output  File  Name 

multiplier . ngr 

Top  Level  Output  File  Name 

multiplier 

Output  Format 

NGC 

Optimization  Goal 

Speed 

Keep  Hierarchy 

NO 
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FILE:  muxl2 8tol . syr 


Final 

Results 

RTL  Top  Level  Output  File  Name 

muxl2  8tol . ngr 

Top  Level  Output  File  Name 

muxl2  8tol 

Output 

Format 

NGC 

Optimization  Goal 

Speed 

Keep  Hierarchy 

NO 

Design 

Statistics 

#  IOs 

137 

Cell  Usage  : 

#  BELS 

349 

# 

AND2 

64 

# 

AND2bl 

64 

# 

AND3 

14 

# 

AND3bl 

14 

# 

LUT1 

66 

# 

MUXF5 

1 

# 

MUXF5  L 

32 

# 

MUXF6 

16 

# 

0R2 

78 

#  10  Buffers 

137 

# 

IBUF 

136 

# 

OBUF 

1 

Device  utilization  summary: 


Selected  Device  :  2v6000ff 1517-4 


Number 

of 

Slices : 

33 

out 

of 

33792 

0% 

Number 

of 

4  input  LUTs: 

66 

out 

of 

67584 

0% 

Number 

of 

bonded  IOBs: 

137 

out 

of 

1104 

12% 
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TIMING  REPORT 

-  PARTS  OMITTED  FOR  BREVITY 

Timing  Detail: 


All  values  displayed  in  nanoseconds  (ns) 


Timing  constraint 

:  Default  path  analysis 

Delay : 

17.386ns 

(Levels 

of  Logic 

:  =  21) 

Source : 

Sel<0>  (PAD) 

Destination : 

XLXN  20  (PAD) 

Data  Path:  Sel<0>  to  XLXN  20 

Gate 

Net 

Cell : in->out 

fanout 

Delay 

Delay 

Logical  Name  (Net  Name) 

IBUF : I->0 

128 

0.825 

1.316 

Sel  0  IBUF  ( Sel  0  IBUF) 

begin  scope: 

' XLXI  3  XLXI 

2  ' 

begin  scope: 

' I  MAB ' 

AND2bl : I0->0 

1 

0.439 

0.517 

I  36  7  (M0) 

OR2 : Il->0 

1 

0.439 

0.517 

o 

CO 

CO 

M 

end  scope:  ' 

I  MAB' 

LUT1 : I0->0 

1 

0.439 

0.000 

MAB  rt  (MAB  rt) 

MUXF5  L : Il->LO  1 

0.436 

0.000 

I  M8B  (M8B) 

MUXF6 : I0->0 

1 

0.447 

0.517 

I  M8F  (MBF) 

begin  scope: 

'10' 

AND3 : I0->0 

1 

0.439 

0.517 

I  36  30  (Ml) 

OR2 : I0->0 

1 

0.439 

0.517 

o 

CO 

CO 

hO 

CO 

M 

end  scope:  ' 

I  O' 

end  scope:  ' 

XLXI  3  XLXI  2 

1 

begin  scope: 

'XLXI  3  XLXI 

4  ' 

AND3 : I0->0 

1 

0.439 

0.517 

I  36  30  (Ml) 

OR2 : I0->0 

1 

0.439 

0.517 

o 

CO 

CO 

CO 

M 

end  scope:  ' 

XLXI  3  XLXI  4 

I 

begin  scope: 

'XLXI  6' 

begin  scope: 

'I  M01 ' 

AND3 : I0->0 

1 

0.439 

0.517 

I  36  30  (Ml) 

OR2 : I 0->O 

1 

0.439 

0.517 

O 

CO 

CO 

CO 

M 

end  scope:  ' 

I  M01  ' 

LUT1 : I0->0 

1 

0.439 

0.000 

M01  rt  (M01  rt) 

MUXF5 : I0->0 

1 

0.436 

0.517 

O 

O 

H 

end  scope:  ' 

XLXI  6' 

OBUF : I->0 

4.361 

XLXN  20  OBUF  (XLXN  20) 

Total 

17.386ns 

(10.895ns  logic,  6.491ns  route) 

(62.7% 

logic,  37.3%  route) 

CPU  :  7.42  /  8.44  s  |  Elapsed  :  7.00  /  8.00  s 


PARTS  OMITTED  FOR  BREVITY 
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FILE:  ramtester . syr 


Release  6.3.03i  -  xst  G.38 

Copyright  (c)  1995-2004  Xilinx,  Inc.  All  rights  reserved. 

-  PARTS  OMITTED  FOR  BREVITY  - 

Input  File  Name  :  ramtester . prj 

-  PARTS  OMITTED  FOR  BREVITY  - 


Final  Report 


Final  Results 

RTL  Top  Level  Output  File  Name 
Top  Level  Output  File  Name 
Output  Format 
Optimization  Goal 
Keep  Hierarchy 


ramtester . ngr 

ramtester 

NGC 

Speed 

NO 


Design  Statistics 

#  IOs  :  11 


Cell  Usage  : 

#  RAMS  :  1 

#  RAM128X1S  :  1 

#  Clock  Buffers  :  1 

#  BUFGP  :  1 

#  10  Buffers  :  10 

#  IBUF  :  9 

#  OBUF  :  1 


Device  utilization  summary: 


Selected  Device  :  2v6000ff 1517-4 


Number 

of 

Slices : 

4 

out 

of 

33792 

0% 

Number 

of 

bonded  IOBs: 

10 

out 

of 

1104 

0% 

Number 

of 

GCLKs : 

1 

out 

of 

16 

6% 

TIMING  REPORT 

NOTE:  THESE  TIMING  NUMBERS  ARE  ONLY  A  SYNTHESIS  ESTIMATE. 

FOR  ACCURATE  TIMING  INFORMATION  PLEASE  REFER  TO  THE  TRACE  REPORT 
GENERATED  AFTER  PLACE-and-ROUTE . 


Clock  Information: 


Clock  Signal 


Clock  buffer (FF  name)  |  Load 
- + - + 
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XLXN  21 

BUFGP  |1| 

— 

- + - 

- + - + 

Timing  Summary: 

-  PARTS 

OMITTED  FOR 

BREVITY 

Data  Path:  XLXN 

26  to  XLXI 

4 

Gate 

Net 

Cell : in->out 

fanout 

Delay 

Delay  Logical  Name  (Net  Name) 

IBUF : I->0 

1 

0.825 

0.517  XLXN  26  IBUF  (XLXN  26  IBUF) 

RAM128X1S : D 

0.727 

XLXI  4 

Total 

2 . 069ns 

(1.552ns  logic,  0.517ns  route) 

(75.0%  logic,  25.0%  route) 

Timing  constraint: 

Default  OFFSET  OUT  AFTER  for  Clock  'XLXN  21' 

Offset : 

7 . 682ns 

(Levels  of 

Logic  =  1) 

Source : 

XLXI  4 

(RAM) 

Destination : 

XLXN  23 

(PAD) 

Source  Clock: 

XLXN  21 

rising 

Data  Path:  XLXI 

4  to  XLXN 

23 

Gate 

Net 

Cell : in->out 

fanout 

Delay 

Delay  Logical  Name  (Net  Name) 

RAM128X1S : WCLK->0  1 

2 . 804 

0.517  XLXI  4  (XLXN  23  OBUF) 

OBUF : I->0 

4.361 

XLXN  23  OBUF  (XLXN  23) 

Total 

7 . 682ns 

(7.165ns  logic,  0.517ns  route) 

(93.3%  logic,  6.7%  route) 

Timing  constraint: 

Default  path 

.  analysis 

Delay : 

8 . 583ns 

(Levels  of 

Logic  =  3) 

Source : 

XLXN  20 

(PAD) 

Destination : 

XLXN  23 

(PAD) 

Data  Path:  XLXN 

20  to  XLXN 

23 

Gate 

Net 

Cell : in->out 

fanout 

Delay 

Delay  Logical  Name  (Net  Name) 

IBUF: I->0 

16 

0.825 

1.000  XLXN  20  IBUF  (XLXN  20  IBUF) 

RAMI 2 8 XI S  :  AO- 

>0  1 

1 . 879 

0.517  XLXI  4  (XLXN  23  OBUF) 

OBUF: I->0 

4.361 

XLXN  23  OBUF  (XLXN  23) 

Total 

8 . 583ns 

(7.065ns  logic,  1.518ns  route) 

(82.3%  logic,  17.7%  route) 

CPU  :  5.50  /  6.51 

s  |  Elapsed  : 

6.00  / 

7.00  s 

-  PARTS 

OMITTED  FOR 

BREVITY 
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B.2  COLLECTED  DATA  TEXT  FILES 


The  following  data  has  been  collected  from  synthesis  reports  and  placed  into  each 
text  file.  For  each  value  of  n,  a  circuit  was  synthesized. 


NetDelay.txt 

MuxDelayWithNet.txt 

AdderDelayWithNet.txt 

MultDelayWithNet.txt 

MultSlices.txt 

n 

Delay  (ns) 

n 

Delay  (ns) 

n 

Delay  (ns) 

n 

Delay  (ns) 

n 

Slices 

i 

0.517 

2 

0.517 

i 

1.474 

2 

4.766 

i 

0 

2 

0.701 

3 

4.0527 

2 

1.658 

4 

5.595 

2 

0 

3 

0.725 

4 

4.0527 

3 

2.638 

6 

6.423 

3 

0 

4 

0.747 

5 

4.5917 

4 

3.617 

8 

7.251 

4 

0 

5 

0.771 

7 

4.5917 

5 

3.205 

12 

8.906 

5 

0 

6 

0.794 

8 

4.5917 

6 

3.258 

16 

10.562 

6 

0 

7 

0.817 

9 

6.6657 

7 

3.311 

17 

10.977 

7 

0 

8 

0.84 

15 

6.6657 

8 

3.364 

18 

16.218 

8 

0 

9 

0.863 

16 

6.6657 

9 

3.417 

19 

16.424 

12 

0 

10 

0.885 

17 

8.6657 

10 

3.47 

20 

16.43 

16 

0 

11 

0.909 

31 

8.6657 

11 

3.523 

20 

16.43 

17 

0 

12 

0.931 

32 

8.6657 

12 

3.576 

21 

16.536 

18 

19 

13 

0.955 

33 

10.6617 

13 

3.629 

24 

16.854 

19 

22 

15 

0.989 

63 

10.6617 

14 

3.682 

32 

17.702 

20 

24 

16 

1 

64 

10.6617 

15 

3.735 

34 

17.914 

20 

24 

17 

1.012 

65 

12.1997 

16 

3.788 

35 

20.518 

21 

26 

18 

1.024 

127  12.1997 

20 

4 

36 

20.624 

24 

32 

19 

1.035 

128  12.1997 

23 

4.159 

37 

20.73 

32 

48 

20 

1.041 

24 

4.212 

51 

22.214 

34 

52 

21 

1.046 

25 

4.265 

52 

22.343 

35 

89 

22 

1.052 

28 

4.424 

53 

22.449 

36 

93 

23 

1.058 

32 

4.636 

54 

22.555 

37 

97 

24 

1.064 

33 

4.689 

55 

22.661 

51 

146 

25 

1.069 

64 

6.332 

64 

23.615 

52 

193 

26 

1.072 

128 

9.724 

68 

24.039 

53 

197 

27 

1.075 

129 

9.777 

69 

26.644 

54 

203 

28 

1.077 

70 

26.75 

55 

206 

32 

1.088 

85 

28.34 

64 

248 

48 

1.129 

86 

28.469 

68 

266 

63 

1.168 

87 

28.575 

69 

348 

64 

1.171 

102  30.165 

70 

353 

65 

1.173 

103  30.294 

85 

445 

79 

1.209 

104  30.4 

86 

520 

80 

1.212 

11931.99 

87 

525 

81 

1.215 

120  32.119 

102  633 

127  1.316 

121  32.225 

103  734 

128  1.316 

128  32.967 

104  740 

129  1.316 

136  33.815 

119  863 

137  36.42 

120  974 

138  36.526 

121  980 

128  1047 

136  1119 

137  1299 

138  1306 
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B.3  ESTIMATION  OF  MISSING  DATAPOINTS 


The  following  plots  show  how  fill  Lin  estimates  missing  the  data  points  in  the  five 
sets  of  collected  data  points.  The  values  returned  from  fillLin  are  used  in  HUandDelay  to 
estimate  component  complexity  and  delay. 


FillLin  data  points  for  NetDelay.txt 


FillLin  data  points  forAdderDelayWithNet.txt 
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FillLin  data  points  for  MuxDelayWithNet.txt 


Mux  Size  (bits) 


FillLin  data  points  for  MultDelayWithNet.txt 
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FillLin  data  points  for  MultSlices.txt 
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THIS  PAGE  INTENTIONALLY  LEFT  BLANK 
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APPENDIX  C.  COMMONLY  USED  VARIABLES 


C.l  VARIABLE  DEFINITIONS 


The  following  is  a  list  of  the  variables  used  in  this  thesis  and  their  descriptions. 


Variable 

Definition(s) 

How  determined 

£ 

Maximum  allowable  error 

Defined  by  system,  here  £  =  2  "  1 

<Tmm 

Minimum  segment  width 

^7  —  4  I  £  ’  [2  linear 

mm  P  [3  quadratic 

C2i  ’  C\i  ’  C0i 

Coefficient  values  for  the  approximation 

equation  for  the  i-th  segment 

Determined  by  segmentation  algorithms. 

i 

Segment  index  number 

SIE  or  part  ofx  determines  i 

k 

Number  of  address  lines  to  the  coefficient  table 

of  an  NFG 

^  =  ri0g2^minl 

n 

1.  Number  of  bits  in  x 

2.  Bus-width  for  a  given  NFG 

Defined  by  NFG  requirements 

s 

number  of  segments  to  be  used  in  an  NFG 

g  _  2r^°§2‘s’minl  _  ^ 

*min 

Minimum  number  of  segments  required  for  an 

NFG 

From  segmentation  algorithms  or  by 

segments,  m 

SRR 

Segment  Reduction  Ratio 

non-unif 

SRR  =  min 

r,Umf 

^min 

t 

prop 

Combinational  propagation  delay  through  a 

logic  device 

Using  models  or  HUandDelay.m 

''"max,/ 

Maximum  value  of  x  in  segment  i 

From  segmentation  algorithms 

X  • 
min,z 

Minimum  value  of  x  in  segment  i 

From  segmentation  algorithms 

y 

Approximation  function,  linear  or  quadratic 

Defined  by  NFG  architecture 

Table  13  Variable  Definitions. 
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C.2  COMMON  VARIABLE  VALUES 


The  following  is  a  list  of  parameters  used  throughout  this  thesis.  These  values 
were  extracted  from  empirical  evidence  and/or  product  specifications  sheets  [18].  The 
values  with  the  more  significant  digits  was  utilized  for  all  calculations. 


Parameter 

Description 

From 

Simulation 

From 

[18] 

^ MUXCY  ,S->0 

Propagation  delay  from  the  select  line  of  MUXCY 

to  the  output. 

0.298  ns 

Note  2 

^ MUXCY, I 0—>O 

Propagation  delay  from  the  either  input  (10  or  11) 

of  MUXCY  to  the  output. 

0.053  ns 

0.05  ns 

Y ORCY 

Referred  to  as  tSOPSOP  [18],  the  propagation  delay 

through  the  fast  SOP  OR  gate,  ORCY. 

0.439  ns 

0.44  ns 

t LUT  A 

Referred  to  as  tILO  [18],  Propagation  delay  through 

a  4 -input  LUT 

0.439  ns 

0.44  ns 

t LUT  ,5 

Referred  to  as  fiFS[18],  Propagation  delay  through 

a  5 -input  LUT 

Note  1 

0.72  ns 

1.  No  simulation  data  for  this  value. 

2.  Value  is  not  found  in  reference. 


Table  14  Common  Variable  Values. 
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APPENDIX  D.  MODEL  DATA 


D.l  COMPLEXITY  AND  DELAY  FOR  BASIC  AND  COMPACT  NFGS  FOR 
THE  FUNCTIONS  IN  THE  FUNCTION  SUITE 


f(x)  =  r  on  [0,1] 


Basic  Architectures 


Compact  Achitectures 


Basic  NFGs  realizing  f(x)=2x  on  the  interval  [0,1] 


Compact  NFGs  realizing  f(x)=2x  on  the  interval  [0,1] 


Basic  NFGs  realizing  f(x)=2x  on  the  interval  [0,1] 


Compact  NFGs  realizing  f(x)=2x  on  the  interval  [0,1] 


n  (bits) 


Basic  NFGs  realizing  f(x)=2x  on  the  interval  [0,1] 


n  (bits) 


Compact  NFGs  realizing  f(x)=2x  on  the  interval  [0,1] 


n  (bits) 


n  (bits) 
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fix)  =  1/x  on  [1,2] 


Basic  Architectures 


Compact  Achitectures 


Basic  NFGs  realizing  f(x)=1/x  on  the  interval  [1,2] 


Compact  NFGs  realizing  f(x)=1/x  on  the  interval  [1,2] 


Basic  NFGs  realizing  f(x)=1/x  on  the  interval  [1,2] 


Compact  NFGs  realizing  f(x)=1/x  on  the  interval  [1,2] 


n  (bits) 


Basic  NFGs  realizing  f(x)=1/x  on  the  interval  [1,2] 


n  (bits) 


Compact  NFGs  realizing  f(x)=1/x  on  the  interval  [1,2] 


n  (bits) 


n  (bits) 
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/ (jc)  =  Vx  on  [1,2] 


Basic  Architectures 


Compact  Achitectures 


n  (bits) 


Compact  NFGs  realizing  f(x)=sqrt(x)  on  the  interval  [1,2] 


Basic  NFGs  realizing  f(x)=sqrt(x)  on  the  interval  [1,2] 


Compact  NFGs  realizing  f(x)=sqrt(x)  on  the  interval  [1,2] 


n  (bits) 


Basic  NFGs  realizing  f(x)=sqrt(x)  on  the  interval  [1,2] 


n  (bits) 


n  (bits) 


Compact  NFGs  realizing  f(x)=sqrt(x)  on  the  interval  [1,2] 


n  (bits) 
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f(x)  =  l/y/x  on  [1,2] 


Basic  Architectures 


Compact  Achitectures 


Basic  NFGs  realizing  f(x)=1/sqrt(x)  on  the  interval  [1,2] 


Compact  NFGs  realizing  f(x)=1/sqrt(x)  on  the  interval  [1,2] 


Basic  NFGs  realizing  f(x)=1/sqrt(x)  on  the  interval  [1,2] 


Compact  NFGs  realizing  f(x)=1/sqrt(x)  on  the  interval  [1,2] 


LUB 
LNB 
-  QUB 
QNB 


L  l  I  I  I  I  I 

5  10  15  20  25  30  35  40  45  50 

n  (bits) 


160  - 
140  - 
120  - 
100  - 


Basic  NFGs  realizing  f(x)=1/sqrt(x)  on  the  interval  [1,2] 

LUB 
LNB 
-  QUB 
QNB 


Compact  NFGs  realizing  f(x)=1/sqrt(x)  on  the  interval  [1,2] 


n  (bits) 


n  (bits) 
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/ 0)  =  log2x  on  [1,2] 


Basic  Architectures 


Compact  Achitectures 


Basic  NFGs  realizing  f(x)=log2(x)  on  the  interval  [1,2] 
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30 
n  (bits) 


Basic  NFGs  realizing  f(x)=log2(x)  on  the  interval  [1,2] 


n  (bits) 


Basic  NFGs  realizing  f(x)=log2(x)  on  the  interval  [1,2] 


n  (bits) 


Compact  NFGs  realizing  f(x)=log2(x)  on  the  interval  [1,2] 


A  LUC 
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- 1 —  QUC 

QNC 
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i 

n  (bits) 


Compact  NFGs  realizing  f(x)=log2(x)  on  the  interval  [1,2] 


n  (bits) 


Compact  NFGs  realizing  f(x)=log2(x)  on  the  interval  [1,2] 


n  (bits) 
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/ (x)  =  ln(x)  on  [1,2] 


Basic  Architectures 


Basic  NFGs  realizing  f(x)=log(x)  on  the  interval  [1,2] 


Basic  NFGs  realizing  f(x)=log(x)  on  the  interval  [1,2] 


Compact  Achitectures 


Compact  NFGs  realizing  f(x)=log(x)  on  the  interval  [1,2] 


Compact  NFGs  realizing  f(x)=log(x)  on  the  interval  [1,2] 


n  (bits) 


Basic  NFGs  realizing  f(x)=log(x)  on  the  interval  [1,2] 


n  (bits) 


n  (bits) 


Compact  NFGs  realizing  f(x)=log(x)  on  the  interval  [1,2] 


n  (bits) 
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/(x)  =  sin;rx  on  [0,0.5] 


Basic  Architectures 


Compact  Achitectures 


Basic  NFGs  realizing  f(x)=sin(pi*x)  on  the  interval  [0,0.5] 


Compact  NFGs  realizing  f(x)=sin(pi*x)  on  the  interval  [0,0.5] 


Basic  NFGs  realizing  f(x)=sin(pi*x)  on  the  interval  [0,0.5] 


n  (bits) 


n  (bits) 


Basic  NFGs  realizing  f(x)=sin(pi*x)  on  the  interval  [0,0.5] 


Compact  NFGs  realizing  f(x)=sin(pi*x)  on  the  interval  [0,0.5] 
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/(x)  =  cos;rx  on  [0,0.5] 


Basic  Architectures 


Compact  Achitectures 


n  (bits) 


Compact  NFGs  realizing  f(x)=cos(pi*x)  on  the  interval  [0,0.5] 


Basic  NFGs  realizing  f(x)=cos(pi*x)  on  the  interval  [0,0.5] 


Compact  NFGs  realizing  f(x)=cos(pi*x)  on  the  interval  [0,0.5] 


n  (bits) 


Basic  NFGs  realizing  f(x)=cos(pi*x)  on  the  interval  [0,0.5] 


n  (bits) 


Compact  NFGs  realizing  f(x)=cos(pi*x)  on  the  interval  [0,0.5] 


n  (bits) 


n  (bits) 
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f(x)  =  tan^x  on  [0,0.25] 


Basic  Architectures 


Compact  Achitectures 


Compact  NFGs  realizing  f(x)=tan(pi*x)  on  the  interval  [0,0.25] 


n  (bits) 


Basic  NFGs  realizing  f(x)=tan(pi*x)  on  the  interval  [0,0.25] 


n  (bits) 


Basic  NFGs  realizing  f(x)=tan(pi*x)  on  the  interval  [0,0.25] 


n  (bits) 


Compact  NFGs  realizing  f(x)=tan(pi*x)  on  the  interval  [0,0.25] 


n  (bits) 


Compact  NFGs  realizing  f(x)=tan(pi*x)  on  the  interval  [0,0.25] 


n  (bits) 


n  (bits) 
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/(jc)  =  V- lnx  on  [1/512,1/4] 


Basic  Architectures 


Compact  Achitectures 


Basic  NFGs  realizing  f(x)=sqrt(-log(x))  on  the  interval  [0.0019531,0.25] 


Compact  NFGs  realizing  f(x)=sqrt(-log(x))  on  the  interval  [0.0019531,0.25] 


n  (bits) 


Compact  NFGs  realizing  f(x)=sqrt(-log(x))  on  the  interval  [0.0019531,0.25] 


n  (bits) 


Basic  NFGs  realizing  f(x)=sqrt(-log(x))  on  the  interval  [0.0019531,0.25] 


Compact  NFGs  realizing  f(x)=sqrt(-log(x))  on  the  interval  [0.0019531,0.25] 


n  (bits) 


n  (bits) 
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/ (x)  =  tan2  nx  + 1  on  [0,0.25] 


Basic  Architectures 


Compact  Achitectures 


Compact  NFGs  realizing  f(x)=(tan(pi*x))  +1  on  the  interval  [0,0.25] 


40  50 


n  (bits) 


Basic  NFGs  rea 


izing  f(x)=(tan(pi*x)r+1  on  the  interval  [0,0.25] 
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Compact  NFGs  realizing  f(x)=(tan(pi*x))'!+1  on  the  interval  [0,0.25] 
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5  10  15  20  25  30  35  40  45  50 

n  (bits) 


n  (bits) 


Basic  NFGs  realizing  f(x)=(tan(pi*x))z+1  on  the  interval  [0,0.25] 


Compact  NFGs  realizing  f(x)=(tan(pi*x))z+1  on  the  interval  [0,0.25] 


n  (bits) 
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/(x)  =  -xlog2  x  +  (l-x)log2  (l  —  x)  on  [1/256,1-1/256] 


Basic  Architectures 


n  (bits) 


Basic  NFGs  realizing  f(x)=0-(x*log2(x)+(1-x)*log2(1-x))  on  the  interval  [0.0039063,0.99609] 


n  (bits) 


Compact  Achitectures 

Compact  NFGs  realizing  f(x)=0-(x*log2(x)+(1-x)*log2(1-x))  on  the  interval  [0.0039063,0.99609] 


Basic  NFGs  realizing  f(x)=0-(x*log2(x)+(1-x)*log2(1-x))  on  the  interval  [0.0039063,0.99609] 


n  (bits) 


Compact  NFGs  realizing  f(x)=0-(x*log2(x)+(1-x)*log2(1-x))  on  the  interval  [0.0039063,0.99609] 


n  (bits) 
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f(x)  =  -±-  on  [0,1] 
l  +  e 


Basic  Architectures 


Compact  Achitectures 


Basic  NFGs  realizing  f(x)=1/(1+exp(-x))  on  the  interval  [0,1] 


Basic  NFGs  realizing  f(x)=1/(1+exp(-x))  on  the  interval  [0,1] 


Compact  NFGs  realizing  f(x)=1/(1+exp(-x))  on  the  interval  [0,1] 


Compact  NFGs  realizing  f(x)=1/(1+exp(-x))  on  the  interval  [0,1] 


n  (bits) 


Basic  NFGs  realizing  f(x)=1/(1+exp(-x))  on  the  interval  [0,1] 


n  (bits) 


n  (bits) 


Compact  NFGs  realizing  f(x)=1/(1+exp(-x))  on  the  interval  [0,1] 


n  (bits) 
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/(*)= 


on 


Basic  Architectures 


Compact  Achitectures 


Basic  NFGs  realizing  f(x)=1/(sqrt(2*pi))*exp(-x2/2)  on  the  interval  [0,1.4142] 


Compact  NFGs  realizing  f(x)=1/(sqrt(2*pi))*exp(-x2/2)  on  the  interval  [0,1.4142] 


n  (bits) 


n  (bits) 


Basic  NFGs  realizing  f(x)=1/(sqrt(2*pi))*exp(-x2/2)  on  the  interval  [0,1.4142] 


Compact  NFGs  realizing  f(x)=1/(sqrt(2*pi))*exp(-x2/2)  on  the  interval  [0,1.4142] 


Basic  NFGs  realizing  f(x)=1/(sqrt(2*pi))*exp(-x2/2)  on  the  interval  [0,1.4142] 


Compact  NFGs  realizing  f(x)=1/(sqrt(2*pi))*exp(-x2/2)  on  the  interval  [0,1.4142] 
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/ (x)  =  sin(ex)  on  [0,2] 


Basic  Architectures 


Compact  Achitectures 


Basic  NFGs  realizing  f(x)=sin(exp(x))  on  the  interval  [0,2] 


Compact  NFGs  realizing  f(x)=sin(exp(x))  on  the  interval  [0,2] 


Basic  NFGs  realizing  f(x )= s i n(ex p(x ))  on  the  interval  [0,2] 


Compact  NFGs  realizing  f(x)=sin(exp(x))  on  the  interval  [0,2] 


n  (bits) 


n  (bits) 


Basic  NFGs  realizing  f(x)=sin(exp(x))  on  the  interval  [0,2] 


Compact  NFGs  realizing  f(x)=sin(exp(x))  on  the  interval  [0,2] 


n  (bits) 


n  (bits) 
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D.2  THE  BEST  BASIC  ARCHITECTURES  FOR  EACH  FUNCTION 


1.  Based  on  Smallest  HUP 


Best  Basic  NFG  based  on  HUP  1=LUB,  2=LNB,3=QUB,  4=QNB  | 
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2. 


Based  on  Shortest  Delay 


|  Best  Basic  NFG  based  on  Delay  1=LUB,  2=LNB,3=QUB,  4=QNB  | 
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1 

1 

1 

3 

1 

3 

i 

i 

3 

32 

1 

1 

1 

1 

1 

1 

1 

1 

1 

3 

1 

3 

i 

i 

3 

33 

1 

1 

1 

1 

1 

1 

1 

1 

1 

3 

1 

3 

i 

i 

3 

34 

1 

1 

1 

1 

1 

1 

1 

1 

1 

3 

1 

3 

i 

i 

3 

35 

1 

1 

1 

1 

1 

1 

1 

1 

1 

3 

1 

3 

i 

i 

3 

36 

1 

1 

1 

1 

1 

1 

1 

1 

1 

ES 

3 

3 

i 

i 

HI 

37 

1 

1 

1 

1 

1 

1 

1 

1 

1 

HI 

K1 

HI 

i 

i 

HI 

38 

1 

1 

1 

1 

1 

1 

3 

3 

i 

i 

39 

1 

3 

1 

1 

3 

1 

3 

3 

O 

m 

HI 

i 

i 

HI 

40 

3 

3 

1 

3 

3 

3 

3 

3 

ra 

■a 

HI 

i 

3 

HI 

41 

3 

3 

1 

3 

3 

3 

3 

3 

m» 

3 

3 

3 

i 

3 

3 

42 

3 

3 

1 

3 

3 

3 

3 

3 

B 

3 

3 

3 

i 

3 

3 

43 

3 

3 

1 

3 

3 

3 

3 

3 

B 

3 

3 

i 

3 

HI 

44 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

i 

3 

45 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

El 

46 

3 

3 

3 

3 

3 

3 

3 

3 

El 

3 

3 

3 

1 

3 

3 

47 

3 

3 

3 

3 

3 

3 

3 

3 

El 

3 

3 

3 

3 

3 

3 

48 

3 

3 

3 

3 

3 

3 

3 

3 

B 

3 

3 

3 

3 

3 

3 

49 

3 

3 

3 

3 

3 

3 

3 

3 

B 

3 

3 

3 

3 

3 

3 

50 

3 

3 

3 

3 

3 

3 

3 

3 

El 

3 

3 

3 

3 

3 

3 

51 

3 

3 

3 

3 

3 

3 

3 

3 

B 

3 

3 

3 

3 

3 

3 

52 

3 

3 

3 

3 

3 

3 

3 

3 

B 

3 

3 

3 

3 

3 

3 

53 

3 

3 

3 

3 

3 

3 

3 

3 

B 

3 

3 

3 

3 

3 

3 

54 

3 

3 

3 

3 

3 

3 

3 

3 

B 

3 

3 

3 

3 

3 

3 

55 

3 

3 

3 

3 

3 

3 

3 

3 

B 

3 

3 

3 

3 

3 

3 

56 

3 

3 

3 

3 

3 

3 

3 

3 

B 

3 

3 

3 

3 

3 

3 

57 

3 

3 

3 

3 

3 

3 

3 

3 

B 

3 

3 

3 

3 

3 

3 

58 

3 

3 

3 

3 

3 

3 

3 

3 

B 

3 

3 

3 

3 

3 

3 

59 

3 

3 

3 

3 

3 

3 

3 

3 

B 

3 

3 

3 

3 

3 

3 

60 

3 

3 

3 

3 

3 

3 

3 

3 

SH 

3 

3 

3 

3 

3 

3 

61 

3 

3 

3 

3 

3 

3 

3 

3 

B 

3 

3 

3 

3 

3 

3 

62 

3 

3 

3 

3 

3 

3 

3 

3 

B 

3 

3 

3 

3 

3 

3 

63 

3 

3 

3 

3 

3 

3 

3 

3 

B 

3 

3 

3 

3 

3 

3 

64 

3 

3 

3 

3 

3 

3 

3 

3 

B 

3 

3 

3 

3 

3 

3 
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D.3  THE  BEST  COMPACT  ARCHITECTURES  FOR  EACH  FUNCTION 
VERSUS  SIZE 


1.  Based  on  Smallest  HUP 


Best  Compact  NFG  based  on  HUP  1=LUB,  2=LNB,3=QUB,  4=QNB  | 

Function  Number 

n 

1 

2 

3 

4 

5 

6 

7 

8 

El 

HI 

eh 

mm 

EH 

14 

EH 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

2 

1 

1 

1 

2 

1 

1 

1 

1 

1 

1 

2 

B 

El 

1 

1 

1 

1 

1 

3 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

2 

1 

1 

1 

4 

1 

1 

1 

2 

1 

2 

1 

1 

1 

1 

1 

1 

2 

1 

5 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

ai 

1 

1 

1 

6 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

2 

7 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

El 

1 

1 

1 

8 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

9 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

2 

1 

1 

1 

10 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

11 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

2 

1 

1 

1 

12 

1 

1 

1 

1 

1 

1 

1 

1 

1 

El 

1 

1 

1 

1 

1 

13 

1 

1 

1 

1 

1 

1 

1 

1 

1 

2 

1 

1 

1 

1 

1 

14 

1 

1 

1 

1 

1 

1 

1 

1 

1 

2 

1 

1 

1 

1 

1 

15 

1 

1 

1 

1 

1 

1 

1 

1 

1 

4 

1 

4 

1 

1 

1 

16 

1 

1 

1 

1 

1 

1 

1 

1 

1 

4 

1 

4 

1 

1 

3 

17 

1 

1 

1 

1 

1 

1 

1 

1 

1 

4 

1 

4 

1 

1 

3 

18 

1 

1 

1 

1 

1 

1 

1 

1 

1 

4 

1 

4 

1 

1 

3 

19 

1 

1 

1 

1 

1 

1 

1 

1 

1 

4 

1 

3 

1 

1 

El 

20 

1 

1 

1 

1 

1 

1 

1 

1 

1 

4 

3 

El 

1 

1 

El 

21 

1 

1 

1 

1 

1 

1 

1 

1 

1 

4 

El 

4 

1 

1 

El 

22 

1 

1 

1 

1 

1 

1 

3 

mi 

El 

4 

El 

4 

1 

1 

El 

23 

3 

3 

1 

3 

3 

3 

3 

4 

3 

1 

3 

24 

3 

3 

3 

3 

3 

3 

3 

Q 

B 

4 

El 

4 

1 

3 

25 

3 

3 

3 

3 

3 

3 

3 

El 

B 

4 

4 

3 

3 

26 

3 

3 

3 

3 

3 

3 

3 

El 

B 

4 

El 

3 

3 

3 

27 

3 

3 

3 

3 

3 

3 

3 

O 

B 

4 

El 

4 

3 

3 

28 

3 

3 

3 

3 

3 

3 

3 

KS 

El 

4 

3 

4 

3 

3 

3 

29 

3 

3 

3 

3 

3 

3 

3 

mm 

El 

4 

3 

4 

3 

3 

3 

30 

3 

3 

3 

3 

3 

3 

3 

mm 

El 

4 

3 

4 

3 

3 

3 

31 

3 

3 

3 

3 

3 

3 

3 

El 

E9 

4 

3 

4 

3 

3 

3 

32 

3 

3 

3 

3 

3 

3 

3 

mm 

El 

4 

3 

4 

3 

3 

3 

33 

3 

3 

3 

3 

3 

3 

3 

b 

El 

4 

3 

4 

3 

3 

3 

34 

3 

3 

3 

3 

3 

3 

3 

El 

E9 

4 

3 

4 

3 

3 

3 

35 

3 

3 

3 

3 

3 

3 

3 

El 

El 

4 

3 

4 

3 

3 

3 

36 

3 

3 

3 

3 

3 

3 

3 

El 

El 

4 

3 

4 

3 

3 

3 

37 

3 

3 

3 

3 

3 

3 

3 

El 

El 

4 

3 

4 

3 

3 

3 

38 

3 

3 

3 

3 

3 

3 

3 

El 

E9 

4 

3 

4 

3 

3 

3 

39 

3 

3 

3 

3 

3 

3 

3 

El 

El 

4 

3 

4 

3 

3 

3 

40 

3 

3 

3 

3 

3 

3 

3 

El 

El 

4 

3 

4 

3 

3 

3 

41 

3 

3 

3 

3 

3 

3 

3 

El 

El 

4 

3 

4 

3 

3 

3 

42 

3 

3 

3 

3 

3 

3 

3 

El 

E9 

4 

3 

4 

3 

3 

3 

43 

3 

3 

3 

3 

3 

3 

3 

KJB 

El 

4 

3 

4 

3 

3 

3 

44 

3 

3 

3 

3 

3 

3 

3 

El 

E9 

4 

3 

4 

3 

3 

3 

45 

3 

3 

3 

3 

3 

3 

3 

El 

El 

4 

3 

4 

3 

3 

3 

46 

3 

3 

3 

3 

3 

3 

3 

El 

El 

4 

3 

4 

3 

3 

3 

47 

3 

3 

3 

3 

3 

3 

3 

El 

El 

4 

3 

4 

3 

3 

3 

48 

3 

3 

3 

3 

3 

3 

3 

El 

El 

4 

3 

4 

3 

3 

3 

49 

3 

3 

3 

3 

3 

3 

3 

El 

El 

4 

3 

4 

3 

3 

3 

50 

3 

3 

3 

3 

3 

3 

3 

El 

B 

4 

3 

4 

3 

3 

3 

51 

3 

3 

3 

3 

3 

3 

3 

El 

E9 

4 

3 

4 

3 

3 

3 

52 

3 

3 

3 

3 

3 

3 

3 

El 

El 

4 

3 

4 

3 

3 

3 

53 

3 

3 

3 

3 

3 

3 

3 

El 

El 

4 

3 

4 

3 

3 

3 

54 

3 

3 

3 

3 

3 

3 

3 

El 

El 

4 

3 

4 

3 

3 

3 

55 

3 

3 

3 

3 

3 

3 

3 

El 

El 

4 

3 

4 

3 

3 

3 

56 

3 

3 

3 

3 

3 

3 

3 

El 

El 

4 

3 

4 

3 

3 

3 

57 

3 

3 

3 

3 

3 

3 

3 

El 

El 

4 

3 

4 

3 

3 

3 

58 

3 

3 

3 

3 

3 

3 

3 

El 

El 

4 

3 

4 

3 

3 

3 

59 

3 

3 

3 

3 

3 

3 

3 

El 

El 

4 

3 

4 

3 

3 

3 

60 

3 

3 

3 

3 

3 

3 

3 

El 

B 

4 

3 

4 

3 

3 

3 

61 

3 

3 

3 

3 

3 

3 

3 

El 

B 

4 

3 

4 

3 

3 

3 

62 

3 

3 

3 

3 

3 

3 

3 

El 

El 

4 

3 

4 

3 

3 

3 

63 

3 

3 

3 

3 

3 

3 

3 

El 

B 

4 

3 

4 

3 

3 

3 

64 

3 

3 

3 

3 

3 

3 

3 

El 

El 

4 

3 

4 

3 

3 

3 
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2. 


Based  on  Shortest  Delay 


|  Best  Compact  NFG  based  on  Delay  1=LUB,  2=LNB,3=QUB,  4=QNB  | 

Function  Number 

n 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

1 

1 

1 

1 

1 

1 

1 

1 

1 

— 

HI 

HI 

HI 

HI 

HI 

HI 

2 

1 

1 

1 

1 

1 

1 

1 

1 

0 

3 

1 

1 

1 

1 

1 

1 

1 

1 

0 

4 

1 

1 

1 

1 

1 

1 

1 

1 

1 

i 

i 

i 

i 

i 

i 

5 

1 

1 

1 

1 

1 

1 

1 

1 

1 

i 

i 

i 

i 

i 

i 

6 

1 

1 

1 

1 

1 

1 

1 

1 

1 

i 

i 

i 

i 

i 

i 

7 

1 

1 

1 

1 

1 

1 

1 

1 

HI 

0.0 

HI 

HI 

HI 

HI 

8 

1 

1 

1 

1 

1 

1 

1 

1 

0 

9 

1 

1 

1 

1 

1 

1 

1 

1 

1 

i 

i 

i 

i 

i 

i 

10 

1 

1 

1 

1 

1 

1 

1 

1 

1 

i 

i 

i 

i 

i 

i 

11 

1 

1 

1 

1 

1 

1 

1 

1 

MM 

Hi 

HI 

HI 

HI 

HI 

HI 

12 

1 

1 

1 

1 

1 

1 

1 

1 

13 

1 

1 

1 

1 

1 

1 

1 

1 

1 

i 

i 

i 

i 

i 

14 

1 

1 

1 

1 

1 

1 

1 

0 

0 

15 

1 

1 

1 

1 

1 

1 

1 

1 

1 

i 

i 

i 

i 

i 

1 

16 

1 

1 

1 

1 

1 

1 

1 

1 

1 

i 

i 

i 

i 

i 

1 

17 

1 

1 

1 

1 

1 

1 

1 

MM 

MM 

HI 

HI 

HI 

HI 

HI 

HI 

18 

1 

1 

1 

1 

1 

1 

1 

19 

1 

1 

1 

1 

1 

1 

1 

20 

1 

1 

1 

1 

1 

1 

1 

1 

1 

i 

i 

i 

i 

i 

21 

1 

1 

1 

1 

1 

1 

1 

1 

0 

HI 

HI 

u 

HI 

HI 

HI 

22 

1 

1 

1 

1 

1 

1 

1 

1 

1 

3 

i 

1 

i 

i 

i 

23 

1 

1 

1 

1 

1 

1 

1 

1 

1 

3 

i 

1 

i 

i 

i 

24 

1 

1 

1 

1 

1 

1 

1 

1 

1 

3 

i 

1 

i 

i 

i 

25 

1 

1 

1 

1 

1 

1 

1 

1 

1 

El 

i 

3 

i 

i 

i 

26 

1 

1 

1 

1 

1 

1 

1 

1 

1 

El 

i 

El 

i 

i 

3 

27 

1 

1 

1 

1 

1 

1 

1 

1 

1 

3 

i 

3 

i 

i 

3 

28 

1 

1 

1 

1 

1 

1 

1 

1 

1 

3 

i 

3 

i 

i 

3 

29 

1 

1 

1 

1 

1 

1 

1 

1 

1 

3 

i 

3 

i 

i 

3 

30 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

3 

El 

i 

i 

El 

31 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

3 

3 

i 

i 

32 

1 

1 

1 

1 

1 

1 

3 

wa 

0 

3 

El 

El 

i 

i 

El 

33 

1 

3 

1 

1 

3 

1 

3 

1 

1 

i 

i 

M 

34 

3 

3 

1 

3 

3 

3 

3 

3 

1 

i 

3 

3 

35 

3 

3 

1 

3 

3 

3 

3 

El 

ra 

El 

El 

3 

i 

3 

3 

36 

3 

3 

3 

3 

3 

3 

3 

O 

0 

E9 

El 

1 

i 

3 

3 

37 

3 

3 

3 

3 

3 

3 

3 

m 

ra 

1 

ms 

1 

3 

3 

3 

38 

3 

3 

3 

3 

3 

3 

3 

Ei 

0 

3 

El 

1 

3 

3 

3 

39 

3 

3 

3 

3 

3 

3 

3 

ms 

0 

1 

El 

1 

3 

3 

3 

40 

3 

3 

3 

3 

3 

3 

3 

m 

0 

3 

3 

1 

3 

3 

3 

41 

3 

3 

3 

3 

3 

3 

3 

m 

m 

3 

3 

3 

3 

3 

3 

42 

3 

3 

3 

3 

3 

3 

3 

o 

0 

3 

3 

1 

3 

3 

3 

43 

3 

3 

3 

3 

3 

3 

3 

0 

El 

3 

3 

3 

3 

3 

3 

44 

3 

3 

3 

3 

3 

3 

3 

mm 

0 

3 

3 

3 

3 

3 

3 

45 

3 

3 

3 

3 

3 

3 

3 

0 

3 

3 

3 

3 

3 

3 

46 

3 

3 

3 

3 

3 

3 

3 

m» 

0 

3 

3 

3 

3 

3 

3 

47 

3 

3 

3 

3 

3 

3 

3 

0 

mm 

3 

3 

3 

3 

3 

3 

48 

3 

3 

3 

3 

3 

3 

3 

sa 

0 

3 

3 

3 

3 

3 

3 

49 

3 

3 

3 

3 

3 

3 

3 

mm 

0 

3 

3 

3 

3 

3 

3 

50 

3 

3 

3 

3 

3 

3 

3 

0 

0 

3 

3 

3 

3 

3 

3 

51 

3 

3 

3 

3 

3 

3 

3 

mm 

mm 

3 

3 

3 

3 

3 

3 

52 

3 

3 

3 

3 

3 

3 

3 

mm 

mm 

3 

3 

3 

3 

3 

3 

53 

3 

3 

3 

3 

3 

3 

3 

mm 

0 

3 

3 

3 

3 

3 

3 

54 

3 

3 

3 

3 

3 

3 

3 

mm 

mm 

3 

3 

3 

3 

3 

3 

55 

3 

3 

3 

3 

3 

3 

3 

mm 

mm 

3 

3 

3 

3 

3 

3 

56 

3 

3 

3 

3 

3 

3 

3 

mm 

mm 

3 

3 

3 

3 

3 

3 

57 

3 

3 

3 

3 

3 

3 

3 

mm 

mm 

3 

3 

3 

3 

3 

3 

58 

3 

3 

3 

3 

3 

3 

3 

mm 

mm 

3 

3 

3 

3 

3 

3 

59 

3 

3 

3 

3 

3 

3 

3 

mm 

mm 

3 

3 

3 

3 

3 

3 

60 

3 

3 

3 

3 

3 

3 

3 

mm 

mm 

3 

3 

3 

3 

3 

3 

61 

3 

3 

3 

3 

3 

3 

3 

0 

mm 

3 

3 

3 

3 

3 

3 

62 

3 

3 

3 

3 

3 

3 

3 

0 

0 

3 

3 

3 

3 

3 

3 

63 

3 

3 

3 

3 

3 

3 

3 

0 

0 

3 

3 

3 

3 

3 

3 

64 

3 

3 

3 

3 

3 

3 

3 

0 

0 

3 

3 

3 

3 

3 

3 
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D.4  PERCENT  HUP  AND  DELAY  DUE  TO  SIE  FOR  LNB  AND  QNB  NFGS 


Basic  architectures  for  f(x)  =  2X  on  [0,1] 


Hardware  Utilization  Percentage 


Combinational  Delay 


HUPS|E/HUPNFG  for  f(x)=2  on  the  interval  [0,1] 


'sie^nfg  f°r  f(x)=2x  on  the  interval  [0,1] 


Basic  architectures  for  /(x)  =  1/x  on  [1,2] 


Hardware  Utilization  Percentage 


Combinational  Delay 


HUPS|E/HUPNFG  for  f(x)=1/x  on  the  interval  [1,2] 


*sie^nfg  f°r  f(x )=  1  on  the  interval  [1,2] 


Basic  architectures  for  /(x)  =  Vx  on  [1,2] 


Hardware  Utilization  Percentage 


Combinational  Delay 


HUPS|E/HUPNFG  for  f(x)=sqrt(x)  on  the  interval  [1,2] 


'sie^nfg  tor  f(x)=sqrt(x)  on  the  interval  [1,2] 
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Basic  architectures  for  / (x)  =  1  / Vx  on  [1,2] 


Hardware  Utilization  Percentage 

HUPS|E/HUPNFG  for  f(x)=1/sqrt(x)  on  the  interval  [1,2] 


Combinational  Delay 

tsiE^NFG  f°r  f(x)=1/sqrt(x)  on  the  interval  [1,2] 
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Basic  architectures  for  / (x)  =  log2  x  on  [1,2] 


Hardware  Utilization  Percentage 


Combinational  Delay 


HUPS|E/HUPNFG  forf(x)=log2(x)  on  the  interval  [1,2] 


tsiE^NFG  *or  )=l°g2(x)  on  the  interval  [1,2] 
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Basic  architectures  for  /(x)  =  ln(x)  on  [1,2] 


Hardware  Utilization  Percentage 

HUPS|^HUPNFG  for  f(x)=log(x)  on  the  interval  [1 ,2] 


Combinational  Delay 

lSIE^NFG  ^0r  f(x)=l°9(x)  on  interval  [h2] 
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Basic  architectures  for  f(x)  =  sin;rx  on  [0,0.5] 


Hardware  Utilization  Percentage 


Combinational  Delay 


HUPgig/HUPNFG  forf(x)=sin(pi*x)  on  the  interval  [0,0.5] 


WInfg  ^or  f(x )=sin(pi*x)  on  the  interval  [0,0.5] 
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Basic  architectures  for  /(x)  =  cos^x  on  [0,0.5] 


Hardware  Utilization  Percentage 


Combinational  Delay 


HUPS|e/HUPNfg  f°r  f(x)=cos(pi*x)  on  the  interval  [0,0.5] 


lSIE^NFG  ^or  f(x)=cos(pi*x)  on  the  interval  [0,0.5] 


Basic  architectures  for  /(x)  =  tan;rx  on  [0,0.25] 


Hardware  Utilization  Percentage 


HUPSE/HUPnfg  f°r  f(x)=tan(pi*x)  on  the  interval  [0,0.25] 


Combinational  Delay 

*sie^nfg  f°r  f(x)=tan(pi*x)  on  the  interval  [0,0.25] 
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Basic  architectures  for  /(x)  =  yj-lnx  on  [1/512,1/4] 


Hardware  Utilization  Percentage 


Combinational  Delay 


HUPS|E/HUPNFG  forf(x)=sqrt(-log(x))  on  the  interval  [0.0019531,0.25] 


*SIE^NFG  f°r  f(x)=sqrt(-log(x))  on  the  interval  [0.0019531,0.25] 


Basic  architectures  for  / (x)  =  tan2  nx  + 1  on  [0,0.25] 


Hardware  Utilization  Percentage 


Combinational  Delay 


HUPs|E/HUPNFGforf(x)=(tan(pi*x))'!+1  on  the  interval  [0,0.25] 


lSlE^NFG  forf(x)=(tan(pi*x))z+1  on  the  interval  [0,0.25] 


Basic  architectures  for  / (x)  =  -x  log 

Hardware  Utilization  Percentage 


2  x  +  (l-x)log2  (l-x)  on  [1/256,1-1/256] 

Combinational  Delay 


HUPS|E/HUPNFG  for  f(x)=0-(x*log2(x)+(1-x)*log2(1-x))  on  the  interval  [0.0039063,0.99609] 


tsiE^NFG  f°r  f(x)=0-(x‘log2(x)+(1-x)‘log2(1-x))  on  the  interval  [0.0039063,0.99609] 
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Basic  architectures  for  f(x)  = - —  on  [0,1] 

1  +  e  x 


Hardware  Utilization  Percentage 

HUPS|^HUPNFG  for  f(x)=1/(1+exp(-x))  on  the  interval  [0,1] 


Combinational  Delay 

*sie^nfg  f°r  f(x)=‘'/(1+exP(_x))  on  the  interval  [0,1] 
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Basic  architectures  for  / (x)  = 


Hardware  Utilization  Percentage 

HUPS|E/HUPNFG  for  f(x)=1/(sqrt(2*pi))*exp(-x2/2)  on  the  interval  [0,1.4142] 


42ft 


on 


4,44 


Combinational  Delay 


*sie^nfg  f°r  f(x)=1/(sqrt(2*pi))*exp(-x^/2)  on  the  interval  [0,1.4142] 


Basic  architectures  for  f(x)  =  sin(ex)  on  [0,2] 


Hardware  Utilization  Percentage 

HUPsig/HUPNFG  t°r  f(x)=sin(exp(x))  on  the  interval  [0,2] 


Combinational  Delay 

*SE^NFG  ^or  f(x)=s'n(exP(x))  on  the  interval  [0,2] 
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